George says his parallelized version of the FFT is written in three lines of Divacon code. Whereas when I wrote my FFT programs in grad school and again later in integer-only code for Sprex, it took me several pages. So I am taking this as a challenge: Can I follow George's footsteps?
Let's see if we can invent a PDC-FFT together. First the discrete Fourier Transform (O(n^2)), then the Fast FT (O(n*log(n)), and finally a Parallel FT (O(log(n)).
For \(b\ in\ 0..(N/2)-1 \), let
\(DFT(b,N,X) = \large\sum_{s=0}^{N-1} X_s * e^{isTb/N}\)Let's take this apart:
First, this is just truly weird. How you get from an exponential, \(e\) to the something power, to a circle in the imaginary plane, I don't know. Still, you could prove it, like this:
First, let \(f(\theta) = \large \frac{cos(\theta)+i*sin(\theta)}{e^{i\theta}}.\)
Second, show the derivative \(f'(\theta)=0\) is zero, meaning that \(f(\theta)\) is flat, unchanging, constant.
Third, show the value at zero is one: \(f(0)=1\). That specifies the constant value which \(f\) has everywhere.
So the ratio \(f(\theta)\) is always \(1\) irrespective of \(\theta\), so the top and the bottom parts of \(f\) must have the same value as each other everywhere, so \(cos(\theta)+i*sin(\theta) = e^{i\theta}.\)
Since that's just an imaginary number with coordinates \(cos(\theta)\) and \(i\ sin(\theta)\) in the imaginary plane, and if you draw it you'll see cos and sin are coordinates of the points on a circle, in fact, on a unit circle centered on the origin. Therefore \(e^{i\theta}\) is a point on the unit circle in the imaginary plane. Now when \(\theta\) exceeds \(T = 2*\pi\) it has rotated around past its original starting point of \((1+i0)\).
Oh, sorry, this will also blow your mind. Notice that suppose you go some angle \(\theta\) around the imaginary unit circle, call it \(e^{i\theta}\). Then suppose you want to double that angle. Well in the \(cos+isin\) interpretation you just use \(2\theta\) and get \(cos(2\theta) + i\ sin(2\theta)\) but that's now the same as \(e^{i(2\theta)}=(e^{i\theta})^2\) by Euler's Formula. So stepping around the circle means taking powers of the step \(e^{i\theta}\). This is very wierd and very amazing and you have to check it to see if you believe it's true, because otherwise you wouldn't believe it.
Done? Satisfied? No? Go back and think about it, this should be in your toolbelt. We will be going back and forth between powers of \(e^{i\theta}\) and increments of angles around the unit circle, all the time, starting now.
Now we pick one value of \(b\), and that value will determine both the step size and the number of bends which will be rotated through after \(N\) samples, which could be \(0, 1, 2\) or any integer up to \(\frac{N}{2}-1\).
Next we align our samples \(X_s\) with the positions of the hours hand on a counterclockwise-rotating clock zeroed at the 3pm hour. The "clock hand" \(e^{i\theta}\) is a 0,0-based unit vector that rotates around the unit circle by a fixed angular step size, one step for each sample. Euler showed us that, above.
We will go back and forth between an angle, \(\theta\) and a point on a unit circle, \(e^{i\theta}\), each encodes the other very nicely, so I might say one when I mean the other. You can figure out what I mean, I think.
To go around the unit circle one bend after all the samples, we use a step of \(\large e^{i\theta_{step}}\) which would rotate 1/N of a circle for each step. That is, \(\theta_{step} = T/N\), which for the s'th sample reaches \(Ts/N\) and by the end of all the N samples (which counted 0 to N-1), we are all the way back to the beginning, or actually one step short of all the way back to the beginning, because the loop is exhaustive but not redundant and we already touched the 0'th step when we started at 0.
To summarize so far, this sum gives us a complex (real + imaginary) number, \(x_b+iy_b\) called the Fourier-transform of X for a certain bend count, \(b\) which is to say a logical frequency component. That point in radial coordinates tells us the size \(r_b\) and angle-shift ("phase") \(\theta_b\) of that summed frequency component in the sampled signal values in our data array. As a point in the imaginary plane, it is typically not on the unit circle. But its angle from \(\theta=0\) at position \((1,0)\) to \(\theta_b\) is what we will call the phase shift of the sinusoid component that the formula has detected, and the length of the vector from 0,0 to that point, \(r_b\) is the magnitude of that sinusoid component.
So one thing is for you to understand how to calculate it, and by now it should be feeling friendly to you, and the other thing is, it's also good to develop some intuitions of what the results mean.
So a given output of the formula will be based on a particular value of \(b\), and to think geometrically, it is created by wrapping the samples around the unit circle at a certain pace, faster in order to finish more bends by the end, whatever \(b\) is, and then given that alignment of samples to angles (or you could say scaling of angles by samples), adding them up.
If at 0 bends this sum is large then the signal has a constant bias, a.k.a. Direct-Current or DC offset. Oftentimes signal processing algorithms will subtract the DC component from all the samples considering it an artifact of the voltage settings of the machinery and uninformative of the sound wave or other tracked signal, which if it were interesting would be changing.
If at 1 bend the sum is large, then the signal contains an ultra-bass, a lowest-measureable-frequency signal component. Why? If the samples now pointing in all directions added up to zero then they just cancel each other and yield nothing for that bend-count, neither in total nor on average. There is nothing there, at that \(b\)-specific pace.
But if they don't cancel out, like in some direction they add up like positive numbers do, but in this case they are positive vectors in some general direction, whereas in the opposite direction they are adding in negatives to the sum, well, think about those negatives, they are negatives to the anti-positive direction and they make a double negative, which means they reinforce the positive-direction parts of the sum, and the total sum thus accumulated if it is far from zero, it reveals a substantial signal component at the particular bend count you were using to step around the circle.
That's the idea of wave analysis, waves have peaks and troughs, and when you wrap them around the circle the troughs being negatives on the other side of the circle they add to the peaks, but only at that bend count at which they line up together. The correlation of both peaks and valleys gives a nice big sum when they are lined up at the right pace around the unit circle. And that means the signal has a significant sinusoidal signal component at that frequency. Cool and amazing, right? I think so.
Finally if you go \((N/2) -1\) bends around the unit circle for each sample step, aligning samples X[0] to X[N-1] to their corresponding s'th-step unit vector and scaling them and adding them all together, that is a way to detect if there is a very high frequency component in your signal. If every even sample you go positive and every odd sample you go negative, or vice versa, at least enough to affect the total sum of them all, then you will end up with a large, high frequency, signal component. Curiously, if you went beyond N/2 - 1 bends to even higher frequencies, the system stops making sense because the alignments start to wrap around and overlap with previous frequencies and you don't get new information. (The similar Nyquist frequency concept says you need two points for every sine wave's peak and trough, here we have half as many frequencies as points. So we will stop at \(b \le \large \frac{N}{2} - 1\).)
I can't stop saying how this frequency analysis thing is super cool and amazing:
Did I mention, Fourier's (1807) Transform seems to have been predated by Gauss' (est. 1805) unpublished work, which tripled Fourier's achievement including not only (1) the idea of this samples-in-time to amplitudes-in-frequency transformation but also (2) the discrete version of it and (3) an efficient method for calculating it, equivalent to what is today called decimation-in-frequency FFT on real samples. Poor Gauss, he of course invented plenty of stuff, but this particular genius idea of his was forgotten and reinvented and finally only made popular by Cooley and Tukey (1965) when it was needed for detecting underground nuclear testing in the Cold War, so I've read. But it's way, Way, WAY worse: at least 11 other authors found ways of doing the same basic thing, whether restricted to sample counts that are factors of 2, or prime factorable, or etc. including the we-only-ignored-you-for-23-years Danielson and Lanczos (1942), whose lemma is used to derive the following (after Press et al, Numerical Recipes):
First, we need to know that Mou's beautiful, even-odd vector division notation, \(d_{eo} X\) can be clumsily written in traditional math form, dividing vector X into evens and odds by letting \(s=0\ \ ..\frac{N}{2}-1, even=2s, odd=2s+1\).
Second, let "step" \(S = S_b = e^{iTb/N}\); so \(S^n\) rotates n steps and \(S^N\) rotates \(b\) bends. (If N is very very large like millions or billions, then maybe \(S_1\) is almost indistinguishable from 1, yielding numerical instability in calculating its powers. In that case, keep a bigger S around with its built in exponent, say 10^5 or whatever, for accurate calculations of \((e^{iT/N})^{sb}\).)
Then follow this derivation:
\(F_{b,N} X\) | \( = \large\sum_{s=0}^{N-1} X_s * S^s \) | |
\( = \large\sum_{s=0}^{\frac{N}{2}-1} X_{2s} * S^{2s}\) | \(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * S^{(2s+1)}\) | |
\( =\ \ \ \ \ "\ \ \) | \(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * S * S^{(2s)}\) | |
\( = \large F_{b,\frac{N}{2}} X_{even} \) | \(\ \ +\ \large S * F_{b,\frac{N}{2}} X_{odds}\) |
But if you prefer using \(e^{i\theta}\) notation to \(S\), you can follow through this derivation of the same thing:
\(F_{b,N} X\) | \( = \large\sum_{s=0}^{N-1} X_s * e^{isTb/N} \) | |
\( = \large\sum_{s=0}^{\frac{N}{2}-1} X_{2s} * e^{i{2s}Tb/N}\) | \(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * e^{i(2s+1)Tb/N}\) | |
\( = F_{b,\frac{N}{2}} X_{even} \) | \(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * e^{iTb/N} * e^{isTb/(N/2)}\) | |
\( = F_{b,\frac{N}{2}} X_{even} \) | \(\ \ +\ e^{iTb/N} * F_{b,\frac{N}{2}} X_{odds}\) |
\(PFT_b = PDC(divide=d_{eo},\ combine=id,\ preadjust=(self,b=2b), postadjust=(other*e^{iTb/N}\ +\ self),\ basepred=atom?,\ basefunc=id)\)Meaning:
This is the essence of parallel decomposition. More than recursive decomposition where a function's definition includes a reference to itself, in parallel decomposition a function's definition includes a distribution of itself. \(!\ F\) says separately apply \(F\) to each of the tuples in the following array. Since they are done separately, they can be done in parallel. Hence the power of parallel, more than mere recursive, analysis.
\(!\) is the whole trick, and everything else revolves around enabling our formalism to use it: dividing arrays to be distributed over, implicitly normalizing their indexes so we can think of them as the same after any number of divisions as each counting from 0 to length-1, and so that F can simply apply to the divided arrays just as to the original; and communicating data from other parts of the array to make the local computation actually be local. The design problem to create a parallel algorithm with Divacon is to figure out what is the minimal nub of local computation, and to place it properly in the expanding and collapsing of the computation tree. Then to move data around so it's ready for that nub of local computation to do its work at that point. And by then you should know whether the divisions should be even/odd or first-half/second-half (LR). It is a 3D dance of perspectives on your problem, and until you get the idea of this approach it seems impossible; but once you get it, any algorithm's coding is almost trivial. George said he wrote his dissertation's benchmarked, reference algorithms in less than a half hour each, and each attained optimal or "sub-optimal" speeds. I think George's perfect English finally failed him, I think by sub-optimal he meant less than the so-far-best times, i.e. faster than the optimal results so far, not slower, or at least he meant "near-optimal". I hope readers didn't read that and misunderstand.
This means the computation graph gets built all the way down to the leaves of the tree by first dividing, as part of \(PFT\), then dividing again as part of recursive calls to \(PFT\), and no other kinds of operations on the way down, because preadjust is almost a no-op. The work, the actual calculations, will be done slightly at the bottom, and almost entirely on the way back up the tree.
"I have mentioned that procedure identifiers are initialized through declaration. Then the attachment of procedure to identifier can be changed by assignment. I have already mentioned how this can be done by means of [function] pointers. There are, of course, other ways. The simplest is not to change the identifier at all, but rather to have a selection index that picks a procedure out of a set. The initialization now defines an array of forms, e.g., procedure array P [1:j] (f1, f2,... ,fj); ... The call P [i] (al, a2, . .., aj) would select the ith procedure body for execution. Or one could define a procedure switch P:= A, B, C and procedure designational expressions so that the above call would select the ith procedure designational expression for execution. The above approaches are too static for some applications and they lack an important property of assignment: the ability to determine when an assigned form is no longer accessible, so that its storage may be otherwise used. A possible application for such procedures, i.e., ones that are dynamically assigned, is as generators. Suppose we have a procedure computing (a) \(large \sum_{k=0}{N} C_k(N) X^k\) as an approximation to some function (b) f(x)= \(large \sum_{k=0}{\inf} C_k X^k\), when the integer N is specified. Now once having found the \(C_k(N)\), we are merely interested in evaluating (a) for different values of \(x\). We might then wish to define a procedure which prepares (a) from (b). This procedure, on its initial execution, assigns, either to itself or to some other identifier, the procedure which computes (a). Subsequent calls on that identifier will only yield this created computation." (p 5-6)Perlis' suggestion seems very close to the incremented-base-step adjustment of the functions \(F_b\) of a multi-frequency optimized parallel Fourier Transform. The parallel breakdown of the function creation seems obvious: a log(N) decomposition of the calculation of base steps.
Let's try generating all base steps \(S_b = e^{iTb/N}\) in \(log(N/2)\) steps:
\(S_1 = \) \(e^{iT/N}\) \(S[] = \) doubleshift \(Ss \) // calculate base step Ss in log N time. where doubleshift \( Ss = \) atom\(?\ \ \ (\ 1\ S\ )\) // base case else \(\ \ \ \ \ \ (\ Ss\ !(\ S_1\ *\ ((last\ 0)\ Ss))\ Ss\ )\) // retain all Ss on the left half, increment copies on the right and
\(PDCFT[b] X = PDC(d_{eo},c_{eo},id,ma(Ss[b]),atom?,id) X\ \ \ \ \ \ \ \) // calculate \(F_{S_b}\) in \(log(N)\) time, in parallel for each \(b\).
So you could choose to accept my claim \(PFT\ \sim\ O(log(N))\) because \(S_b\)'s are precalculated, or you could choose to believe it because in the curious math of computational complexity, \(O(2*log(N)) = O(log(N))\), as they say, "within a constant factor". In either case I was shocked to discover I had reduced a 1024 point DFT (to take a typical audio analysis frame size) from the well-known FFT time of \(O(N * log(N)) = O(1024*10)\) to this amazing PFT time of \(O(log(N)) = O(10)\). A 1000X speedup is shocking. In fact I didn't believe it, before I forced myself to imagine 1000 processors working on it together. Edible piranhas indeed. Typical of George, he said laconically of this Divacon expression, "Looks right to me", and of the speed result, "Of course, order log N in parallel". He knew it already and was not the least surprised, yet my Google searches only find NVidia itself, and some Israeli computer scientists in 2023, as saying that this is even an imaginable goal. So I'm confused if I deserve a Satisfactory for my homework, or a Turing Award for out-Gaussing Gauss himself. Does anyone else find this amusing?
A few more speedups might be worth mentioning. If we do calculate the \(S_b\)'s, then the application of PFT[b] may be started as soon as its corresponding \(S_b\) is calculated, so that only the last of the former must be so delayed as to start only after the final result of the latter; staggered early start might save some parallel-computation nodes' time before the last come in. Also if you happen to only care about getting initial (lower-frequency) results from the analysis, you could stop at any specified discard threshold, and save a fraction of the time.
To reiterate, if fully parallelized then the N functions \(PFT_b\) are, first, themselves generated in log(N) time if not precalculated entirely, and second, applied to the sampled data vector X in log(N) time. Hence the entire vector of frequencies F is calculated in O(2*log(N)) = O(log(N)) time. There are a lot of multiply adds going on, but they are done in parallel in a small number, log(n), of levels, and each level takes time O(1).
From what little I know, I believe this Parallel Divacon Fourier Transform algorithm is the fastest Fast Fourier Transform algorithm. My academic friends gently correct me, saying No, FFT is O(N*log(N), but yes, that is incorrect where parallel. I found a 2023 ArXiv reference that also claims O(log(N)) speed for an FFT, but that is a pretty complex paper, and was my 12-symbol PFT homework really that fancy? No. That references in turn refers to NVidia's efforts to beat O(n log(N)) with its GPUs. Apparently O(log(N)) is actually newsworthy. I'll take it.
On the other hand, this analysis ignores any cost for plenty of processors, and assumes zero time cost for communications, which actually becomes a dominant factor in highly parallel computing.
H. BURKHARDT,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.
H. BURKHARDT,"Trigonometrische Reihenund Integrale(bis etwa 1850)," in Chap- ter 12, Encyklopiidie der Mathematischen Wissenschaften,vol. 2, part 1, 2nd half, 1904-1916.
J. W. COOLEY & J. W. TUKEY, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. Comput., Vol. 19, no. 2, pp. 297-301, Reprinted in Digital SignalProcessing,ed.L. R.RABINER& C. M. RADER,pp.223-227,New York: IEEE Press,1972,Apr. 1965.
C. F. Gauss, "Nachlass, Theoria InterpolationisMethodo Nova Tractata," in Carl Friedrich Gauss Werke, Band 3, Koniglichen Gesellschaft der Wissenschaften: Gottingen, pp. 265-330, 1866.
MICHAEL T. HEIDEMAN, DON H. JOHNSON, C. SIDNEY BURRUS "Gauss and the History of the Fast Fourier Transform", 1985. Rice U Eng. Dept.
Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication Viterbi Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel. ArXiv 2304.02336v1 [cs.AR] 5 Apr 2023. Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing Lecture. (ACM V 14 No 1, 1967).
Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795.