Skip to main content

Integer factorisation: what’s the problem?

The problem of factoring is a very well studied problem, in which we know almost nothing. It is a good assumption that this problem, which we will refer to as FACTOR\mathrm{FACTOR} has been studied longer then the problem of just verifying primeness. Yet, even though this problem is well known, and relatively simple, we have yet to solve it.

But, what do I mean by “solve” FACTOR\mathrm{FACTOR}?

A little on complexity

Definition. A problem PP can be solved in T(n)T(n) time, where nn is the size of the input if there is some Turing machine MM which solves PP in less-or-equal to cT(n)c \cdot T(n) steps, where cR+c \in \mathbb{R}^+.

In layman terms, our algorithm runs in T(n)T(n) time if it does not take more than T(n)T(n) steps to complete. The nitpick here, is how we get our input.

As with many computers, in the problem of factoring (and basically all problems in computer science), our input is the number nn, represented in binary. This means that the size of our input is actually Θ(logn)\Theta(\log n), and not nn itself. As it turns out, this makes things much harder to solve.

The PRIME\mathrm{PRIME} problem

Before we look at FACTOR\mathrm{FACTOR}, let us take a look at a simpler problem - deciding whether a given number is prime or not. In 2002, an article published by three Indian computer scientists (initials AKS) showed that indeed, given a number nn, we can check if it is prime in O(logkn)O(\log ^k n) time, for some k>0k > 0. While the algorithm itself is quite complex, we will look at its simpler varient which achieves similar time, but is probabilistic.

The Miller-Rabin test checks for primality following a simple argument: if we randomly pick a number yZny \in \mathbb{Z}^*_n, then if nn is composite, there is a 34\frac{3}{4}-th chance that yn1≢1(modn)y^{n-1} \not\equiv 1 \pmod n - and in that case, by Fermat’s little theorem, we can conclude that nn is not prime. This test, while simple, achieves excellent running time in real-world computers, as there is little overhead for each pass. The AKS algorithm, on the other hand, takes much longer due to edge-case treatment, and various overhead. Still, it was a revolution in computer science, and mathematics alike.

The FACTOR\mathrm{FACTOR} problem

The FACTOR\mathrm{FACTOR} problem, is a simple problem where, given an integer nn we are asked to return a list of factors, such that n=fin = \prod f_i. More formally, we are asked to return a list of primes, along with their multiplicity, such that n=pimin = \prod {p_i}^{m_i}. Note that this list of factors can be easily bounded by Θ(logn)\Theta(\log n), thus the main difficulty is actually finding the factors, rather than the number of different prime factors.

Today, it is not even known how hard FACTOR\mathrm{FACTOR} really is. This means, that we do not currently know whether FACTORP\mathrm{FACTOR} \in \mathsf{P}, or even if it is NP\mathsf{NP}-complete. As a consequence, if it is not in either of those classes, then we’d have FACTORNP-intermediate\mathrm{FACTOR} \in \mathsf{NP}\text{-intermediate}. Unfortunately, this would also not help, as problems such as the graph isomorphism, discrete log are also suspected to resign in that class.

In fact, there are some encryption protocols (e.g., RSA) that depend on the fact that integer factorisation is a hard problem. So, it seems unlikely that FACTOR\mathrm{FACTOR} would be in P\mathsf{P} - or, actually, there is great interest to crack this problem (yet it has not been cracked today).

Current State of the Art

If we stick with the idea that FACTORP\mathrm{FACTOR} \notin \mathsf{P} (a decent assumption), but still wish to improve our runtime, then we are left with a new issue (besides finding a new algorithm): what time does it take? Our task is to find some function tt that satisfies O(logkn)<t(n)<O(nt) O(\log ^k n) < t(n) < O(n^t) for any kN,tR+k \in \mathbb{N}, t \in \mathbb{R}^+. Throughout day-to-day tasks, such functions tt are unheard of! Do they even exist? Well, they do, and this can be shown by taking log\log on all sides: kloglogn<logt(n)<tlogn k \log \log n < \log t(n) < t \log n Inequality remains by the concavity of the log\log function. By letting g(n)=logt(n)g(n) = \log t(n), we can see that g(n)g(n) must be some logarithm, but to the power less-than 11, and can also be loglog\log \log, to the power of anything greater than 11. In other words, we have: t(n)=exp(c(logn)α(loglogn)β) t(n) = \exp\left(c\cdot (\log n)^\alpha \cdot (\log \log n)^\beta\right) for α<1\alpha < 1 and β>0\beta > 0.

As it turns out (shocker!) these functions do have a name, and are denoted using LL-notation. Given some nn, and 0α10 \le \alpha \le 1, we have: Ln[α,c]:=exp((c+o(1)(logn)α(loglogn)1α) L_n [\alpha, c] := \exp\left((c + o(1) \cdot (\log n)^\alpha (\log \log n)^{1-\alpha}\right)

Today, the best known deterministic algorithm is the general number field sieve (GNFS) algorithm, which runs in α=13\alpha = \frac{1}{3} time. Additionally, there is a probabilistic algorithm (SSL) which in α=12\alpha = \frac{1}{2} time, albeit with a much smaller constant (half of that of GNFS, to be exact).