Parallel Fourier Transform, defined and explained below, computes the DFT in O(log(N)) time in parallel, disregarding communications. In case of \(N=1024\), its time cost is O(10), as compared with the sequential (Gauss, Tukey) "Fast" FT time cost of O(10,240) or the naive DFT time cost of O(1,048,576).
Evidently Divacon parallelization is powerful, if we can just learn its simple elements and practice its simple way of thinking.
\(!\ f\ X\) is the essence of parallel decomposition.The symbol \(!\) here says apply f to each of the separate elements of X. It doesn't need them to be done in any particular order, and in particular they can be done separately and all at once, that is to say by parallel processors, if you have them available to do the work, or not, if not. If you can take apart your algorithm so that it contains itself not just recursively but inside a distribution, as in \(f\ =\ ...\ !\ f\ ...\) then you have a parallel program.
\( log : * = + : log\)saying that the log of a product is the sum of the logs. Right? For example, \(log(a*b*c)=log(a)+log(b)+log(c)\). Morphisms are helpful if it is easier to go do something in a different domain and then come back to the same result.
This is getting close to \(f\ A\ =\ c\ :\ !\ f\ :\ d\ A\) which would say that \(f\) on arrays is (can safely be defined as) a morphism of itself on divided sub-arrays that are then combined. That would be a mathematical proof for a parallel decomposition of a function.
Divacon defines a higher-order-function generator, PDC(), which accepts six functions, and returns a parallel program implementing the algorithm embodied in those functions. Define and apply a PDC as follows:
\(pdc\ =\ PDC(\)divide, combine, preadjust, postadjust, basepred, basefunc\()\)
\(Y\ =\ pdc\ X\)
A PDC so defined is equivalent to the following expression, which constitutes a general if not exhaustive theory of parallel algorithm decomposition.
Divacon:
\(pdc = (\)basepred ? \(then\) basefunc \(else\) combine : postadjust : \(\ pdc\ \) : preadjust : divide ) Study this until it is intuitive.
The six functions are mostly trivial or one selection from a small set of canonical functions. Their combination is to be understood as follows.
The ":" composition operator operates right to left: \((f:g)(x) = f(g(x))\). So the first function in the above chain is the divide function.
Having divided a bunch of stuff typically in half, we then apply to each half the next function in the chain: preadjust. Preadjust is going to be a "communication function" which specifies, and then brings in, what other data is needed to do this computation.
The local computation which does the work will be needing all its relevant data to be local, so that means we need to capture everything it needs and bring it in. In particular there will be a tuple containing everything the local operation requires. Further, there will be an array of such tuples, because to do a parallel process means to identically do the same in a bunch of parallel processors, and to do that we need to provide the same kinds of data, whatever the process needs, namely some tuple of all the required data elements, to each of all the parallel processors, which means we need an array of those tuples. Then we can send one of the tuples to each of the processors, and it will be able to do its work separately from all the other processors, which means it can be done in parallel, which is the point.
Whereas if you need to separately think about what you need in the midst of the working of each parallel process, and then copy over what it requires from wherever it is, well now you're talking spaghetti code, and we don't tolerate such poor hygeine here.
Although this was also discussed here the following may be helpful. divide, and combine are structure manipulation functions: they take an array, the input data, and restructure it. \(d_{eo}\) divides a single input array into two, taking evens to the left and odds to the right; while \(c_{eo}\) combines two input arrays into one, shuffling them into even and odd positions in the output. This has nothing to do with going elsewhere to get any data (communication), or doing anything to the data (local calculations): it is orthogonal. But once you've specified your divide, and combine functions, you have a whole structure up and down the parallel computation graph which makes life simpler for the other tasks. Similarly communication and local calculation are orthogonally specifiable, and better so. Computation resides in two forms, specifying what to get, and getting it. But the more basic idea is to think about this in the context of parallel programming. In a parallel program, there is going to be some data which every processor will be separately processing, because to be parallel we want them to be working independently. This means we would like that data to be entirely local to each process, so that it can just do its local calculations without interrupting itself to go fetch data from some other place. This means all the data it needs must be provided in a data structure fed to it, separately. This means we want an array of tuples, one for each subprocess, each to contain all the data required by any instance of the subprocess.. Then each one can proceed on its own, which is what makes it a parallel program. Secondly, consider the interrelationships among the data of a parallel process. Is it not always and necessarily a bunch of data? If it's a bunch of data can we not always go through it and make an array of it all? Of course, yes. Having made an array of it, and by assumption it's all the data our parallel process will be using, isn't there some potential dependency between the different elements of the data? You may need, for processing this part of the array, some data from some corresponding other part of the array? That's the idea here. Let's put it all together: pdc = combine : postadjust : pdc : preadjust : divide Reading backwards (that is, in compositional order from right to left), the first, divide function takes the input which is an array of data, and divides it typically in half (but dividing in 1/3 or 1/N's is okay too). Now it is tuple-building time, so that we can run the process separately on its collected required data elements, stuffed into one tuple for each process. The preadjust function builds the tuple, first by specifying what is the one corresponding element (or more) which needs to be brought in here for the local computation, that is, by an index generating function. Second, by copying the index'ed other data from the other sub-array in to this element's tuple, by a communication function.
So the index generating function takes the normalized index within our subarray (the index of "this" tuple), and somehow calculates the index or indexes of the desired other data, in a somehow-corresponding place in the other. For example, the \(N\)th elements in both subarrays will both have index \(N\), so it's not too complicated to figure it out from one what is the other. (That's called corr for "Correspondent communication".) Another, mirr for "Mirror" pairs up the \(N\)th in this sub-array with the length\(-N\)th in the other, and vice versa, in both cases the "other" is known for any array's Nth element without asking anywhere else for outside data.
Then the communication function builds our tuple. It takes the local array element as the first element in the tuple, and it fetches from the other array the indexed element whose index was generated by the index generating function. So both right and left sub-arrays get manipulated into new structures, now they become arrays of tuples.
Does it seem as if we could wait? No! We want all the communication to be handled at once, and separately from the local calculations. We create nice tidy separate workspaces with everything needed in a nice tuple of operating supplies, so then the local operation task can be clean and simple.
At this point we have specified divide, combine, and communicate. Divide and combine are functions given separately to PDF(), while communicate might be just a compositional part of the preadjust function (the first-applied part), if it is followed by a local computation function to be carried out in the building up of the computation graph.
The postadjust function might also include communication as well as local operation.
See, parallel processing builds up a computation graph which is a tree. The tree at the top is a single node that includes the whole giant problem as one bite, but at the leafs of the tree the problem is usually too trivial to even do anything. You can often swallow without even chewing once!
You might think (I confuse myself by thinking) the tree does calculations at each node thereby doing a recursive breakdown or buildup of results from smaller parts of the tree, but actually those calculations are not so much done at the node, as during the building out of the tree, or during the summarizing or winding back up of the tree. The calculations (except at the very ends, the leafs of the tree), are always and only going to be done on the way up or on the way down. Once again, there will be no separate traversing the tree to do operations at each node outside the preadjust or postadjust processes which themselves build out the branches from the nodes, or collapse the branches into the nodes, respectively.
Finally, how do we go about understanding and defining for our local operation? Let's take those separately.
Understand the local operation by recognizing
That should help you.
- that it will be a function which takes a tuple (or you might say list of arguments) as inputs;
- that it cannot access anything outside itself and its inputs, so you could put constants in it for your math to work out, or if you need something else then you have to build that into your original or communication-expanded data before you even get to the local operation;
- that it will be distributively (separately, simultaneously, on separate processors) applied to all the elements in an array (of tuples), one process and one copy of the function operating on one tuple;
- that its output will be a single element, perhaps a 1-tuple, with the type and structure of an element in the original (top-level-input) data array on the way from root to leaf, or with the type and structure of the ultimate answer in the final (top-level-output) data;
- that all known local operation functions are goddamn simple. If it takes you more than a minute to write one, you can be sure you're be wrong. As George says, things are either obvious or impossible. So keep thinking until you see the simple essence of how the function is built out of itself.
Next, defining your local operation. This will be the nub of the idea of the parallel decomposition. If you stare at your equation, you'll find a repeating pattern which allows the problem to be broken down into multiple smaller instances of the same problem. How those instances return a result which is combined to make the larger solution, that combination work is the work of the local operation.
Actually, you really don't have to solve the problem, at all, just see that it contains a simpler or smaller version of itself.
In FFT it's shift (multiply by a constant) the sum of the odds and add to the sum of the evens. In Reverse it's delete self accept other (referring to the 2 elements of a communicated 2-tuple).
I forgot to mention, anything that is needed that is NOT part of the (variable-content) array can be specified as a constant in the local operation function itself. Everything else must come from the input array.
basepred and basefunc are discussed here. So that's everything. Structure manipulation in the divide and combine functions; orthogonal to communication which is the first applying part of the preadjust or postadjust functions; orthogonal to local operation which is the second-applying part of the preadjust or postadjust functions.
Examples
Parallel Fourier Transform For \(b\ in\ 0..(N/2)-1 \), let
\(DFT(b,N,X) = \large\sum_{s=0}^{N-1} X_s * e^{isTb/N}\)Definitions:
Our nub is to separate even and odd terms, which can be calculated separately and added, and then to notice the odd sum can factor out our base step angle \(S_b = e^{iTb/N}\) leaving the identical evens' formula, or calculation process, but using the odds' data. Since the subproblems are half the size, the step angle doubles from \(S_b\) to \((S_b)^2\), to reach the same number of rotations after only half the steps.
- 1 Tau or \(T=2\pi\ \) radians = 1 bend = 360 degrees
- \(N\) is the number of samples in our data array.
- \(T/N\) is divides the circle into \(N\) equal fractions,
- \(s \in 0..N-1\) counts the samples in our index-normalized data array.
- \(bT\) is \(b\) full circles or "bends".
- \(s/N\) is a fraction ranging from 0 to 1 as \(s\) ranges from 0 to N.
- \(sTb/N\) ranges from 0 to Tb (that is, b bends) as \(s\) ranges from 0 to N
- \(Tb/N\) is our base step size.
- \(e^{i\theta}=cos(\theta)+i\ sin(\theta)\) means increasing \(\theta\) by one can also be done by multiplying by \(e^{i1}\) instead of working inside the cosine and sine functions' arguments. This allows factoring out a power of \(e^i\) from the odd summands (\(s\) odd) in DFT above. Which means the odds sub-function is just the evens sub-function times that power of \(e^{iTb/N}\).
\(F_{S_b,N} X\) \( = \large\sum_{s=0}^{N-1} X_s * (S_b)^s \) \( = \large\sum_{s=0}^{\frac{N}{2}-1} X_{2s} * (S_b)^{2s}\) \(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * (S_b)^{(2s+1)}\) \( =\ \ \ \ \ "\ \ \) \(\ \ +\ \large\sum_{s=0}^{\frac{N}{2}-1} X_{(2s+1)} * S_b * (S_b)^{(2s)}\) \( = \large F_{(S_b)^2,\frac{N}{2}} X_{evens} \) \(\ \ +\ \large S_b * F_{(S_b)^2,\frac{N}{2}} X_{odds}\) Since \(F_b\) is inside and outside, this can be transcribed as a PDC parallelization of \(F_b\).
The following Divacon expression implements a Parallel Fourier Transform (PFT) algorithm:
and in Divacon code:
\(F_{b,N}\ X\ = \) atom ? \((C(id),C(id), b=0)\) \(c_{eo} : (\text{self}+S^b*\text{other, self}-S^b*\text{other}, 2b)\ :\ \#\ !\ corr\ :\ !\ F_{2b2,\frac{N}{2}}\ :\ d_{eo}\ X \)
Breakdown and Explanation:
pft = PDC( \( \text{divide}=d_{eo}, \text{combine}=c_{eo},\) \(\text{preadjust}=id, \text{postadjust}=!(\text{self}+\text{other}*e^{-iTb/N}, \text{self}+\text{other}*e^{iTb/N}, 2b) : \# ! \text{corr}, \) \(\text{basepred}=\text{atom}?,\text{ basefunc}=(C(id), C(id), b=0)\ )\)
Meaning:
\(F_{b,N}\ X\ \ \ \ \) Fourier Transform of samples X[0..N-1] at a rate of \(b/N\) rotations per sample.
\((f\ :\ g)(x) = f(g(x))\) \(:\) means function composition, \(f\) upon \(g\). First run \(g\), then run \(f\) on the output of \(g\). So \(d_{eo}\) is the first operation above. \(c_{eo}\ :\ d_{eo}\ =\ d_{eo}^{-1} : d_{eo} = id\ \ \ \) \(c,d\) mean combine and divide. \(eo\) means by evens and odds (think: deal out, shuffle together).
\(id\) is the function whose output is identical to its input.\(!\ f\ A\) \(!\) means function distribution. Apply \(f\) separately to each element of array A (i.e. in parallel). The result comes back as an array of all the results (usually a very short array of just two results). \(self+S*other\)
The local operation here is simply to multiply the odds result by a step factor \(S\) and add the evens result. \(S=S_b=e^{iTb/N}\)
The step factor is a multiplyable angle that doubles at each deeper level of the division tree. Or, it halves at every level, on the way up.
Tukey calls this the twiddle factor. Also \(T=2\pi\).
- Inputs \(b, N, X\): Let N be a power of two so we can divide X in half all the way until we get singleton arrays at the end. \(b\) bends or full rotations defines the angular frequency of interest, via an initial base step angle \(Tb/N\) expressed as \(S^b = e^{iTb/N}\). As we count across all our samples, we also cycle \(b\) times around our periodic reference signal \(e^{isTb/N}\).
- Divide the source data array using even odd division, \(d_{eo}\). The 0th element is even, the 1st is odd, and after \(d_{eo}\) division the latter becomes the 0th element of the odds, the right hand sub-array. The evens and the odds are two sub arrays of length \(N/2\), with normalized indices \(0\ ..(N/2)-1\).
- Recursion on \(F\): Although \(F\) seems to be doing local calculations at each node and branch of the tree, and it is indeed, but all its work down to the leafs is done by its own internal reinstantiations of the divide, combine, preadjust and postadjust, at the different levels; there is nothing within \(F\) other than those. It's strangely empty in the middle. This is the trick: it's empty in the middle; everything is in the recursive or parallel breakdown. Can you think of it that way?
- Distribution: \(!\ F\) means apply F to each of the tuples supplied to it by the function to the right, which will be the preadjust function.
This is the essence of parallel decomposition. More than recursive decomposition where a function's definition includes a reference to itself, in parallel decomposition a function's definition includes a distribution of itself. \(!\ F\) says separately apply \(F\) to each of the tuples in the following array. Since they are done separately, they can be done in parallel. Hence the power of parallel, more than mere recursive, analysis.
\(!\) is the whole trick, and everything else revolves around enabling our formalism to use it: dividing arrays to be distributed over, implicitly normalizing their indexes so we can think of them as the same after any number of divisions as each counting from 0 to length-1, and so that F can simply apply to the divided arrays just as to the original; and communicating data from other parts of the array to make the local computation actually be local. The design problem to create a parallel algorithm with Divacon is to figure out what is the minimal nub of local computation, and to place it properly in the expanding and collapsing of the computation tree. Then to move data around so it's ready for that nub of local computation to do its work at that point. And by then you should know whether the divisions should be even/odd or first-half/second-half (LR). It is a 3D dance of perspectives on your problem, and until you get the idea of this approach it seems impossible; but once you get it, any algorithm's coding is almost trivial. George said he wrote his dissertation's benchmarked, reference algorithms in less than a half hour each, and each attained optimal or "sub-optimal" speeds. I think George's perfect English finally failed him, I think by sub-optimal he meant less than the so-far-best times, i.e. faster than the optimal results so far, not slower, or at least he meant "near-optimal". I hope readers didn't read that and misunderstand.
- No preadjustment: If \(preadjust = id\), it is a no-op; the output of \(id\) is just its input; no change.
This means the computation graph gets built all the way down to the leaves of the tree by first dividing, as part of \(pft\), then dividing again as part of recursive calls to \(pft\), and no other kinds of operations on the way down, because preadjust is almost a no-op. The work, the actual calculations, will be done slightly at the bottom, and almost entirely on the way back up the tree.
\(PDC\ FT\) in O(log(N))
Perlis has a hint in his Turing Award talk (ACM V 14 No 1, 1967), p 5-6:"I have mentioned that procedure identifiers are initialized through declaration. Then the attachment of procedure to identifier can be changed by assignment. I have already mentioned how this can be done by means of [function] pointers. There are, of course, other ways. The simplest is not to change the identifier at all, but rather to have a selection index that picks a procedure out of a set. The initialization now defines an array of forms, e.g., procedure array P [1:j] (f1, f2,... ,fj); ... The call P [i] (al, a2, . .., aj) would select the ith procedure body for execution. Or one could define a procedure switch P:= A, B, C and procedure designational expressions so that the above call would select the ith procedure designational expression for execution. The above approaches are too static for some applications and they lack an important property of assignment: the ability to determine when an assigned form is no longer accessible, so that its storage may be otherwise used. A possible application for such procedures, i.e., ones that are dynamically assigned, is as generators. Suppose we have a procedure computing (a) \(large \sum_{k=0}{N} C_k(N) X^k\) as an approximation to some function (b) f(x)= \(large \sum_{k=0}{\inf} C_k X^k\), when the integer N is specified. Now once having found the \(C_k(N)\), we are merely interested in evaluating (a) for different values of \(x\). We might then wish to define a procedure which prepares (a) from (b). This procedure, on its initial execution, assigns, either to itself or to some other identifier, the procedure which computes (a). Subsequent calls on that identifier will only yield this created computation." (p 5-6)Perlis' suggestion seems very close to the incremented-base-step adjustment of the functions \(F_b\) of a multi-frequency optimized parallel Fourier Transform. The parallel breakdown of the function creation seems obvious: a log(N) decomposition of the calculation of base steps.
Let's try generating all base steps \(S_b = e^{iTb/N}\) in \(log(N/2)\) steps:
\(S_1 = \) \(e^{iT/N}\) \(S[] = \) doubleshift \(Ss \) // calculate base step Ss in log N time. where doubleshift \( Ss = \) atom\(?\ \ \ (\ 1\ S\ )\) // base case else \(\ \ \ \ \ \ (\ Ss\ !(\ S_1\ *\ ((last\ 0)\ Ss))\ Ss\ )\) // retain all Ss on the left half, increment copies on the right and
\(PDCFT[b] X = PDC(d_{eo},c_{eo},id,ma(Ss[b]),atom?,id) X\ \ \ \ \ \ \ \) // calculate \(F_{S_b}\) in \(log(N)\) time, in parallel for each \(b\).
Conclusion
The previous section is nice, but its whole computation of \(S_b\) values is actually independent of the data so it could be precalculated once for a given N and stored for multiple later uses.So you could choose to accept my claim \(PFT\ \sim\ O(log(N))\) because \(S_b\)'s are precalculated, or you could choose to believe it because in the curious math of computational complexity, \(O(2*log(N)) = O(log(N))\), as they say, "within a constant factor". In either case I was shocked to discover I had reduced a 1024 point DFT (to take a typical audio analysis frame size) from the well-known FFT time of \(O(N * log(N)) = O(1024*10)\) to this amazing PFT time of \(O(log(N)) = O(10)\). A 1000X speedup is shocking. In fact I didn't believe it, before I forced myself to imagine 1000 processors working on it together. Edible piranhas indeed. Typical of George, he said laconically of this Divacon expression, "Looks right to me", and of the speed result, "Of course, order log N in parallel". He knew it already and was not the least surprised, yet my Google searches only find NVidia itself, and some Israeli computer scientists in 2023, as saying that this is even an imaginable goal. So I'm confused if I deserve a Satisfactory for my homework, or a Turing Award for out-Gaussing Gauss himself. Does anyone else find this amusing?
Later I discovered a reference to "algorithms using n processors and running in O(1og n) time" for the FFT, p233 in Selim Aki's 1989 book, The Design and Analysis of Parallel Algorithms.A few more speedups might be worth mentioning. If we do calculate the \(S_b\)'s, then the application of PFT[b] may be started as soon as its corresponding \(S_b\) is calculated, so that only the last of the former must be so delayed as to start only after the final result of the latter; staggered early start might save some parallel-computation nodes' time before the last come in. Also if you happen to only care about getting initial (lower-frequency) results from the analysis, you could stop at any specified discard threshold, and save a fraction of the time.To reiterate, if fully parallelized then the N functions \(PFT_b\) are, first, themselves generated in log(N) time if not precalculated entirely, and second, applied to the sampled data vector X in log(N) time. Hence the entire vector of frequencies F is calculated in O(2*log(N)) = O(log(N)) time. There are a lot of multiply adds going on, but they are done in parallel in a small number, log(n), of levels, and each level takes time O(1).
From what little I know, I believe this Parallel Divacon Fourier Transform algorithm is the fastest Fast Fourier Transform algorithm. My academic friends gently correct me, saying No, FFT is O(N*log(N), but yes, that is incorrect where parallel. I found a 2023 ArXiv reference that also claims O(log(N)) speed for an FFT, but that is a pretty complex paper, and was my 12-symbol PFT homework really that fancy? No. That references in turn refers to NVidia's efforts to beat O(n log(N)) with its GPUs. Apparently O(log(N)) is actually newsworthy. I'll take it.
On the other hand, this analysis ignores any cost for plenty of processors, and assumes zero time cost for communications, which actually becomes a dominant factor in highly parallel computing.
Bibliography
John Backus 1977 Turing Lecture.H. BURKHARDT,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.
H. BURKHARDT,"Trigonometrische Reihenund Integrale(bis etwa 1850)," in Chap- ter 12, Encyklopiidie der Mathematischen Wissenschaften,vol. 2, part 1, 2nd half, 1904-1916.
J. W. COOLEY & J. W. TUKEY, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. Comput., Vol. 19, no. 2, pp. 297-301, Reprinted in Digital SignalProcessing,ed.L. R.RABINER& C. M. RADER,pp.223-227,New York: IEEE Press,1972,Apr. 1965.
C. F. Gauss, "Nachlass, Theoria InterpolationisMethodo Nova Tractata," in Carl Friedrich Gauss Werke, Band 3, Koniglichen Gesellschaft der Wissenschaften: Gottingen, pp. 265-330, 1866.
MICHAEL T. HEIDEMAN, DON H. JOHNSON, C. SIDNEY BURRUS "Gauss and the History of the Fast Fourier Transform", 1985. Rice U Eng. Dept.
Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication Viterbi Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel. ArXiv 2304.02336v1 [cs.AR] 5 Apr 2023. Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing Lecture. (ACM V 14 No 1, 1967).
Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795.