We present high-performance techniques for
FFT-based polynomial multiplication on multi-cores, with a focus on
unbalanced input data. We show that balanced data can maximize parallel
speed-up and minimize cache complexity
for bivariate multiplication. Then, we show how multivariate (and
univariate) multiplication can be efficiently reduced to balanced
bivariate multiplication.
This approach brings in practice dramatic improvements for unbalanced input data, using the Cilk++} concurrency
platform, even if the implementation relies on serial one-dimensional FFTs. This latter property allows us
to take advantage efficient non-standard and memory efficient FFT techniques, such as Truncated Fourier Transform,
which are hard to parallelize. |