Divide and Conquer FFT

A Homework Exercise
in which the recursive decomposition of DFT (O(n2)), namely FFT (O(n log(n)),
is made parallel (achieving time O(log(N))

On Mou
on Divide and Conquer

TL;DR

I wrote this to document my own education in Divacon (Mou, 1990). May it help you too.

Parallel Fourier Transform, defined and explained below, computes the DFT in O(log(N)) time in parallel, disregarding communications. In case of \(N=1024\), its time cost is O(10), as compared with the sequential (Gauss, Tukey) "Fast" FT time cost of O(10,240) or the naive DFT time cost of O(1,048,576).

Evidently Divacon parallelization is powerful, if we can just learn its simple elements and practice its simple way of thinking.

A Language For Parallelism

This conceptual approach, language, notation are from George Mou's 1990 Yale Computer Science PhD thesis. Alan Perlis said to George Mou, "This is the golden key to future computer science". To begin, we will be needing some math-like notation and terms.

So a vector might not be normalized, then become normalized, then be operated upon while normalized, which makes the algorithm simple, then be inverse-normalized back to its original indexing. Separating the calculations from the indexing struggles makes things nice and simple. George likes to distinguish between communications versus operations. Getting the data you need, and doing something with it, should be separate concerns. Indeed, each has nothing intrinsically to do with the other. So we do relative indexing so as to forget about where the data came from and what indexes it may have had in that location, so that our operations can be simple and pure.

Parallel Forms

Next, some combining forms: And some math won't hurt us:

Finally the simplest of all things, \(id\): Was it simple enough? Maybe you're stuck on morphisms, but with ! and : can you see how \(f\ =\ c\ :\ !\ f\ :\ d\) would be a parallel decomposition? That's all you really need.

Decomposition: How to Think Clearly About Computing

Writing a Divacon program is checking 6 boxes on a checklist. All you have to is just: define the divide and combine functions which manipulate structures, the communication function which specifies what other data is needed to do a computation, and the local computation which can be carried out when the data is present. Easy to separately define, and systematically used in the Divacon structure, the work is done so easily and quickly that George tells me, Tom, if it takes longer than one minute, then your answer is wrong!

Divacon defines a higher-order-function generator, PDC(), which accepts six functions, and returns a parallel program implementing the algorithm embodied in those functions. Define and apply a PDC as follows:

\(pdc\ =\ PDC(\)divide, combine, preadjust, postadjust, basepred, basefunc\()\)
\(Y\ =\ pdc\ X\)

A PDC so defined is equivalent to the following expression, which constitutes a general if not exhaustive theory of parallel algorithm decomposition.

Divacon:
\(pdc = (\)basepred ? \(then\) basefunc
\(else\) combine : postadjust : \(\ pdc\ \) : preadjust : divide )

Study this until it is intuitive.

The six functions are mostly trivial or one selection from a small set of canonical functions. Their combination is to be understood as follows.

The ":" composition operator operates right to left: \((f:g)(x) = f(g(x))\). So the first function in the above chain is the divide function.

Having divided a bunch of stuff typically in half, we then apply to each half the next function in the chain: preadjust. Preadjust is going to be a "communication function" which specifies, and then brings in, what other data is needed to do this computation.

The local computation which does the work will be needing all its relevant data to be local, so that means we need to capture everything it needs and bring it in. In particular there will be a tuple containing everything the local operation requires. Further, there will be an array of such tuples, because to do a parallel process means to identically do the same in a bunch of parallel processors, and to do that we need to provide the same kinds of data, whatever the process needs, namely some tuple of all the required data elements, to each of all the parallel processors, which means we need an array of those tuples. Then we can send one of the tuples to each of the processors, and it will be able to do its work separately from all the other processors, which means it can be done in parallel, which is the point.

Whereas if you need to separately think about what you need in the midst of the working of each parallel process, and then copy over what it requires from wherever it is, well now you're talking spaghetti code, and we don't tolerate such poor hygeine here.

Although this was also discussed here the following may be helpful. divide, and combine are structure manipulation functions: they take an array, the input data, and restructure it. \(d_{eo}\) divides a single input array into two, taking evens to the left and odds to the right; while \(c_{eo}\) combines two input arrays into one, shuffling them into even and odd positions in the output. This has nothing to do with going elsewhere to get any data (communication), or doing anything to the data (local calculations): it is orthogonal. But once you've specified your divide, and combine functions, you have a whole structure up and down the parallel computation graph which makes life simpler for the other tasks. Similarly communication and local calculation are orthogonally specifiable, and better so. Computation resides in two forms, specifying what to get, and getting it. But the more basic idea is to think about this in the context of parallel programming. In a parallel program, there is going to be some data which every processor will be separately processing, because to be parallel we want them to be working independently. This means we would like that data to be entirely local to each process, so that it can just do its local calculations without interrupting itself to go fetch data from some other place. This means all the data it needs must be provided in a data structure fed to it, separately. This means we want an array of tuples, one for each subprocess, each to contain all the data required by any instance of the subprocess.. Then each one can proceed on its own, which is what makes it a parallel program. Secondly, consider the interrelationships among the data of a parallel process. Is it not always and necessarily a bunch of data? If it's a bunch of data can we not always go through it and make an array of it all? Of course, yes. Having made an array of it, and by assumption it's all the data our parallel process will be using, isn't there some potential dependency between the different elements of the data? You may need, for processing this part of the array, some data from some corresponding other part of the array? That's the idea here. Let's put it all together: pdc = combine : postadjust : pdc : preadjust : divide Reading backwards (that is, in compositional order from right to left), the first, divide function takes the input which is an array of data, and divides it typically in half (but dividing in 1/3 or 1/N's is okay too). Now it is tuple-building time, so that we can run the process separately on its collected required data elements, stuffed into one tuple for each process. The preadjust function builds the tuple, first by specifying what is the one corresponding element (or more) which needs to be brought in here for the local computation, that is, by an index generating function. Second, by copying the index'ed other data from the other sub-array in to this element's tuple, by a communication function.

So the index generating function takes the normalized index within our subarray (the index of "this" tuple), and somehow calculates the index or indexes of the desired other data, in a somehow-corresponding place in the other. For example, the \(N\)th elements in both subarrays will both have index \(N\), so it's not too complicated to figure it out from one what is the other. (That's called corr for "Correspondent communication".) Another, mirr for "Mirror" pairs up the \(N\)th in this sub-array with the length\(-N\)th in the other, and vice versa, in both cases the "other" is known for any array's Nth element without asking anywhere else for outside data.

Then the communication function builds our tuple. It takes the local array element as the first element in the tuple, and it fetches from the other array the indexed element whose index was generated by the index generating function. So both right and left sub-arrays get manipulated into new structures, now they become arrays of tuples.

Does it seem as if we could wait? No! We want all the communication to be handled at once, and separately from the local calculations. We create nice tidy separate workspaces with everything needed in a nice tuple of operating supplies, so then the local operation task can be clean and simple.

At this point we have specified divide, combine, and communicate. Divide and combine are functions given separately to PDF(), while communicate might be just a compositional part of the preadjust function (the first-applied part), if it is followed by a local computation function to be carried out in the building up of the computation graph.

The postadjust function might also include communication as well as local operation.

See, parallel processing builds up a computation graph which is a tree. The tree at the top is a single node that includes the whole giant problem as one bite, but at the leafs of the tree the problem is usually too trivial to even do anything. You can often swallow without even chewing once!

You might think (I confuse myself by thinking) the tree does calculations at each node thereby doing a recursive breakdown or buildup of results from smaller parts of the tree, but actually those calculations are not so much done at the node, as during the building out of the tree, or during the summarizing or winding back up of the tree. The calculations (except at the very ends, the leafs of the tree), are always and only going to be done on the way up or on the way down. Once again, there will be no separate traversing the tree to do operations at each node outside the preadjust or postadjust processes which themselves build out the branches from the nodes, or collapse the branches into the nodes, respectively.

Finally, how do we go about understanding and defining for our local operation? Let's take those separately.

Understand the local operation by recognizing

That should help you.

Next, defining your local operation. This will be the nub of the idea of the parallel decomposition. If you stare at your equation, you'll find a repeating pattern which allows the problem to be broken down into multiple smaller instances of the same problem. How those instances return a result which is combined to make the larger solution, that combination work is the work of the local operation.

Actually, you really don't have to solve the problem, at all, just see that it contains a simpler or smaller version of itself.

In FFT it's shift (multiply by a constant) the sum of the odds and add to the sum of the evens. In Reverse it's delete self accept other (referring to the 2 elements of a communicated 2-tuple).

I forgot to mention, anything that is needed that is NOT part of the (variable-content) array can be specified as a constant in the local operation function itself. Everything else must come from the input array.

basepred and basefunc are discussed here. So that's everything. Structure manipulation in the divide and combine functions; orthogonal to communication which is the first applying part of the preadjust or postadjust functions; orthogonal to local operation which is the second-applying part of the preadjust or postadjust functions.

Examples

Parallel Fourier Transform

For \(b\ in\ 0..(N/2)-1 \), let

\(DFT(b,N,X) = \large\sum_{s=0}^{N-1} X_s * e^{isTb/N}\)

Definitions: Our nub is to separate even and odd terms, which can be calculated separately and added, and then to notice the odd sum can factor out our base step angle \(S_b = e^{iTb/N}\) leaving the identical evens' formula, or calculation process, but using the odds' data. Since the subproblems are half the size, the step angle doubles from \(S_b\) to \((S_b)^2\), to reach the same number of rotations after only half the steps.

\(F_{S_b,N} X\) \( = \large\sum_{s=0}^{N-1} X_s * (S_b)^s \)
\( = \large\sum_{s=0}^{\frac{N}{2}-1} X_{2s} * (S_b)^{2s}\)\(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * (S_b)^{(2s+1)}\)
\( =\ \ \ \ \ "\ \ \)\(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * S_b * (S_b)^{(2s)}\)
\( = \large F_{(S_b)^2,\frac{N}{2}} X_{evens} \)\(\ \ +\ \large S_b * F_{(S_b)^2,\frac{N}{2}} X_{odds}\)

Since \(F_b\) is inside and outside, this can be transcribed as a PDC parallelization of \(F_b\).

The following Divacon expression implements a Parallel Fourier Transform (PFT) algorithm:

\(F_{b,N}\ X\ = \) atom ?  \((C(id),C(id), b=0)\)
otherwise  \(c_{eo} : (\text{self}+S^b*\text{other, self}-S^b*\text{other}, 2b)\ :\ \#\ !\ corr\ :\ !\ F_{2b2,\frac{N}{2}}\ :\ d_{eo}\ X \)
and in Divacon code:

pft = PDC(\( \text{divide}=d_{eo}, \text{combine}=c_{eo},\)
\(\text{preadjust}=id, \text{postadjust}=!(\text{self}+\text{other}*e^{-iTb/N}, \text{self}+\text{other}*e^{iTb/N}, 2b) : \# ! \text{corr}, \)
\(\text{basepred}=\text{atom}?,\text{ basefunc}=(C(id), C(id), b=0)\ )\)

Breakdown and Explanation:

\(F_{b,N}\ X\ \ \ \ \) Fourier Transform of samples X[0..N-1] at a rate of \(b/N\) rotations per sample.
\((f\ :\ g)(x) = f(g(x))\)    \(:\) means function composition, \(f\) upon \(g\). First run \(g\), then run \(f\) on the output of \(g\). So \(d_{eo}\) is the first operation above.

\(c_{eo}\ :\ d_{eo}\ =\ d_{eo}^{-1} : d_{eo} = id\ \ \ \)    \(c,d\) mean combine and divide. \(eo\) means by evens and odds (think: deal out, shuffle together).
\(id\) is the function whose output is identical to its input.

\(!\ f\ A\)    \(!\) means function distribution. Apply \(f\) separately to each element of array A (i.e. in parallel). The result comes back as an array of all the results (usually a very short array of just two results).

\(self+S*other\)   

The local operation here is simply to multiply the odds result by a step factor \(S\) and add the evens result.

\(S=S_b=e^{iTb/N}\)   

The step factor is a multiplyable angle that doubles at each deeper level of the division tree. Or, it halves at every level, on the way up.
Tukey calls this the twiddle factor. Also \(T=2\pi\).
Meaning:

\(PDC\ FT\) in O(log(N))

Perlis has a hint in his Turing Award talk (ACM V 14 No 1, 1967), p 5-6:
"I have mentioned that procedure identifiers are initialized through declaration. Then the attachment of procedure to identifier can be changed by assignment. I have already mentioned how this can be done by means of [function] pointers. There are, of course, other ways. The simplest is not to change the identifier at all, but rather to have a selection index that picks a procedure out of a set. The initialization now defines an array of forms, e.g., procedure array P [1:j] (f1, f2,... ,fj); ... The call P [i] (al, a2, . .., aj) would select the ith procedure body for execution. Or one could define a procedure switch P:= A, B, C and procedure designational expressions so that the above call would select the ith procedure designational expression for execution. The above approaches are too static for some applications and they lack an important property of assignment: the ability to determine when an assigned form is no longer accessible, so that its storage may be otherwise used. A possible application for such procedures, i.e., ones that are dynamically assigned, is as generators. Suppose we have a procedure computing (a) \(large \sum_{k=0}{N} C_k(N) X^k\) as an approximation to some function (b) f(x)= \(large \sum_{k=0}{\inf} C_k X^k\), when the integer N is specified. Now once having found the \(C_k(N)\), we are merely interested in evaluating (a) for different values of \(x\). We might then wish to define a procedure which prepares (a) from (b). This procedure, on its initial execution, assigns, either to itself or to some other identifier, the procedure which computes (a). Subsequent calls on that identifier will only yield this created computation." (p 5-6)

Perlis' suggestion seems very close to the incremented-base-step adjustment of the functions \(F_b\) of a multi-frequency optimized parallel Fourier Transform. The parallel breakdown of the function creation seems obvious: a log(N) decomposition of the calculation of base steps.

Let's try generating all base steps \(S_b = e^{iTb/N}\) in \(log(N/2)\) steps:

\(S_1 = \)\(e^{iT/N}\)
\(S[] = \)doubleshift \(Ss \) // calculate base step Ss in log N time.
where doubleshift \( Ss = \)atom\(?\ \ \ (\ 1\ S\ )\) // base case
else \(\ \ \ \ \ \ (\ Ss\ !(\ S_1\ *\ ((last\ 0)\ Ss))\ Ss\ )\)// retain all Ss on the left half, increment copies on the right

and

\(PDCFT[b] X = PDC(d_{eo},c_{eo},id,ma(Ss[b]),atom?,id) X\ \ \ \ \ \ \ \) // calculate \(F_{S_b}\) in \(log(N)\) time, in parallel for each \(b\).

Conclusion

The previous section is nice, but its whole computation of \(S_b\) values is actually independent of the data so it could be precalculated once for a given N and stored for multiple later uses.

So you could choose to accept my claim \(PFT\ \sim\ O(log(N))\) because \(S_b\)'s are precalculated, or you could choose to believe it because in the curious math of computational complexity, \(O(2*log(N)) = O(log(N))\), as they say, "within a constant factor". In either case I was shocked to discover I had reduced a 1024 point DFT (to take a typical audio analysis frame size) from the well-known FFT time of \(O(N * log(N)) = O(1024*10)\) to this amazing PFT time of \(O(log(N)) = O(10)\). A 1000X speedup is shocking. In fact I didn't believe it, before I forced myself to imagine 1000 processors working on it together. Edible piranhas indeed. Typical of George, he said laconically of this Divacon expression, "Looks right to me", and of the speed result, "Of course, order log N in parallel". He knew it already and was not the least surprised, yet my Google searches only find NVidia itself, and some Israeli computer scientists in 2023, as saying that this is even an imaginable goal. So I'm confused if I deserve a Satisfactory for my homework, or a Turing Award for out-Gaussing Gauss himself. Does anyone else find this amusing?

Later I discovered a reference to "algorithms using n processors and running in O(1og n) time" for the FFT, p233 in Selim Aki's 1989 book, The Design and Analysis of Parallel Algorithms.
A few more speedups might be worth mentioning. If we do calculate the \(S_b\)'s, then the application of PFT[b] may be started as soon as its corresponding \(S_b\) is calculated, so that only the last of the former must be so delayed as to start only after the final result of the latter; staggered early start might save some parallel-computation nodes' time before the last come in. Also if you happen to only care about getting initial (lower-frequency) results from the analysis, you could stop at any specified discard threshold, and save a fraction of the time.

To reiterate, if fully parallelized then the N functions \(PFT_b\) are, first, themselves generated in log(N) time if not precalculated entirely, and second, applied to the sampled data vector X in log(N) time. Hence the entire vector of frequencies F is calculated in O(2*log(N)) = O(log(N)) time. There are a lot of multiply adds going on, but they are done in parallel in a small number, log(n), of levels, and each level takes time O(1).

From what little I know, I believe this Parallel Divacon Fourier Transform algorithm is the fastest Fast Fourier Transform algorithm. My academic friends gently correct me, saying No, FFT is O(N*log(N), but yes, that is incorrect where parallel. I found a 2023 ArXiv reference that also claims O(log(N)) speed for an FFT, but that is a pretty complex paper, and was my 12-symbol PFT homework really that fancy? No. That references in turn refers to NVidia's efforts to beat O(n log(N)) with its GPUs. Apparently O(log(N)) is actually newsworthy. I'll take it.

On the other hand, this analysis ignores any cost for plenty of processors, and assumes zero time cost for communications, which actually becomes a dominant factor in highly parallel computing.

Bibliography

John Backus 1977 Turing Lecture.

H. BURKHARDT,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.

H. BURKHARDT,"Trigonometrische Reihenund Integrale(bis etwa 1850)," in Chap- ter 12, Encyklopiidie der Mathematischen Wissenschaften,vol. 2, part 1, 2nd half, 1904-1916.

J. W. COOLEY & J. W. TUKEY, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. Comput., Vol. 19, no. 2, pp. 297-301, Reprinted in Digital SignalProcessing,ed.L. R.RABINER& C. M. RADER,pp.223-227,New York: IEEE Press,1972,Apr. 1965.

C. F. Gauss, "Nachlass, Theoria InterpolationisMethodo Nova Tractata," in Carl Friedrich Gauss Werke, Band 3, Koniglichen Gesellschaft der Wissenschaften: Gottingen, pp. 265-330, 1866.

MICHAEL T. HEIDEMAN, DON H. JOHNSON, C. SIDNEY BURRUS "Gauss and the History of the Fast Fourier Transform", 1985. Rice U Eng. Dept.

Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication Viterbi Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel. ArXiv 2304.02336v1 [cs.AR] 5 Apr 2023. Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing Lecture. (ACM V 14 No 1, 1967).

Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795.

Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: June 13, 2024; Modified June 17, 2024