The multiplication of univariate polynomials and
multi-precision integers is one of the most basic, well-studied, and
widely-used operations in mathematical computing. While asymptotic time
complexity correlates with performance, it is well-known that memory
access and cache misses also contribute significantly to the running
time of computer programs. Unfortunately, all algorithms which achieve
sub-quadratic time complexity for multiplication currently require at
least O(n) auxiliary storage space for their implementation.
New algorithms are presented which achieve the same time complexity as
Karatsuba's and FFT-based algorithms without requiring the use of an
auxiliary array for temporary storage. We will discuss some
optimizations and other issues encountered in writing a low-level C
language implementation of these algorithms with the goal of high
performance. The memory usage and cache performance will be compared to
other existing software, and the effects on the actual running time will
be examined. Finally, we will mention some future application areas and
further extensions that could be beneficial. |