Divacon Tutorial

How to think about parallel algorithms

On Mou
on Divide and Conquer

TL;DR

Divacon Parallel Algorithm Design:

Your problem is to write a function that does something with an array. (Is it even possible for parallel computation to be anything else?)

To solve any computing problem, you will necessarily make use of three orthogonal constituents, none of which includes anything of the others.

To understand your problem is to break your task into these constituents. To build the best parallel algorithm, hang those parts where they belong on the general, recursive parallel computation scaffold, which is the Divacon equation:

$$\begin{align}f\ A\ =\ base?\ & f_b\ \\ else\ \ \ \ & c\ :\ post\ :\ !f\ :\ pre\ :\ d\ \ A. \end{align}$$

Where:
  • \(f\) is your function, \(A\) is the data array it operates upon.

  • ":" means function composition. \(f : g\ x = f(g(x))\); we don't like unnecessary parentheses.

  • \(f\) contains \(f\), means this is a recursive definition.

  • \(!f\) means apply \(f\) separately to any number of inputs, that is, if you want, in parallel.

  • \(f =\ ..!f..\) produces a tree structured computation graph: each (parent) instance of \(f\) has some number of daughter instances \(!f\).

    Here \(f\) is more an organizing principle of a control pattern than what you might normally think of as a meaningful self-contained function. All the work in \(f\) is done by the parts of \(f\), always outside the included \(f\)'s, in what we call \(c, d, pre, post\) parts -- except for the tiniest trivial bit (\(f_b\)) at the base of the division tree. The point of \(f\) is merely that it is recursive and parallel, and otherwise it does nothing but orchestrate the sequence, the subdivision, the mutual dependencies of its parts.

  • \(c\) combines once for each time \(d\) divides. After \(d\) and before \(c\) we have data parallelism: each subdivision of the data can be operated on separately.

  • \(f_b\), the base function, converts a data value from input type to output type once division is down to a single data value.

  • \(pre = preop : precomm\ \ \ \) and \(\ \ \ post = postop : postcomm\ \ \ \ \ \) Here, communication feeds local operations in both cases.

  • \(preop, postop\) prepare for or recover from the recursive calls \(!f\). Connecting \(d\) to \(f\), or \(f\) to \(c\), these enable separated sub-arrays to be separately fed to instances of \(f\), or enable sets of \(f\) outputs to be combined so as to have the form and meaning of the function's output. These do local calculations only, isolated from any communication or division or combination. Each converts data of the input or output type from one level to another but retains the same general data type on the input or output sides, respectively.

  • \(precomm, postcomm\) do communication only, to bring in any additional data needed so that local calculations can be local.

Working right to left in the formula, and as you recursively break down the inputs, take the array, and apply division until you can't divide any more. After that, apply combination until you can't combine any more. On this general scaffold, you need three things to happen. At the limit of division, you need a function to convert from input data type to output data type (it might just be a conceptual change of units not always a numerical change of values). On the way up or down or both, you may need a local calculation to be done at each level to prepare the inputs for separate application of \(f\), or to postprocess the outcomes so their combination will be a valid function output, within each level. Those local calculations will depend on local values, certainly, but also perhaps on values that need to be communicated from elsewhere in this or another array. Those dependencies determine what data is needed locally to do the calculation, and what needs to be communicated from elsewhere.

Once you get clarity on the Local-Co-SM breakdown of your problem, it is easy to express it in Divacon, and you now have a language and methodology for parallel algorithm design. You will find it enables optimal and concise results.

Okay that was your quick review. If you have a spark burning, now, read on.

Introduction

These ideas, approach, formalization, notation, and system, come to me from the great George Mou, whom I am lucky and proud to know as a friend.

George's work in the math of algorithm parallelization using his elegant and powerful Divide and Conquer formalization, Divacon, is both revolutionary and generally unknown. For example, today's version of the Wikipedia entry for Divide-And-Conquer parallelization fails to even mention George's work. The Divacon approach enables miraculously clear thinking.

Let's not be like that. Properly educated, we could build unbelieveably simple, provably correct, highly optimized parallelizations of important as well as new algorithms. So let's get educated!

Forgive me for a bit of a rant. Because this is the exact kind of BS that has happened repeatedly in the past in the very subject I'm going to work on as my homework with you here, the FFT. Did you know that actually it was Gauss who invented the Fourier Transform before Fourier, Gauss who further invented the discrete FT, Gauss who further invented the Fast FT just like we use today, in 1805, all at once when Fourier was doing the first bit in 1807. But noone paid attention to it until Cooley and Tukey reinvented it in 1965. Even after that it was further discovered that not just one or two or several but many other reinventions of the same ideas had been forgotten -- including a review of all the forgotten reinventions, saying how terrible it was that they were forgotten -- which itself was forgotten for 60 more years! Here's a review of that reviewer:

"A second reference to Gauss' algorithm ... was written by H. Burkhardt in 1904. Burkhardt comments that Gauss' method was general, but seemingly not known by practitioners. It is interesting to note that Goldstine's and Burkhardt's works went almost as unnoticed as Gauss' work itself." (Heidemann, Johnson, & Burrus, 1985)

It may be without irony that George tells me he thinks the world may be ready for this in 200 years. No, it would be typical. (Am I sensitive?)

When I see this it kills me. It is on us to capture and use these achievements, to go through these doorways of genius. If we forget it like Gauss's notebooks or Danielson and Lanczos (1942) or the many others, then it is us, we, the dummies who should regret it, and after folks rediscover it later, we will be ashamed. Our age will be thought of with condescension, like those Aztecs or Incas who had wheeled toys but never built wheeled vehicles: so close and yet so far. Worse in our case, because our new powerful ideas have actually been built and used, but we still ignored it. So let's actually not!

Let's push it forward, and bring it into our understanding and our practice.

In service of this, my small contribution here: motivation, explanation, tutorial of how to think about parallel problems, a little more verbose redundancy to make it easier to read than George's terse mathematics. I hope this brings George the credit he deserves, sooner than 200 years hence.

Inspiration

George Mou's 1990 Yale Computer Science PhD thesis was written in the Golden Age of parallel programming which was going on at Yale in the late 1980's and early 1990's.

In that golden age Alan Perlis was alive, saying things like

Also remember how Backus admonished us of the limits of von Neuman architecture (1-word-at-a-time communication, implied especially by the assignment operator), "Conventional programming languages .. Inherent defects .. inability to effectively use powerful combining forms for building new programs from existing ones, and their lack of useful mathematical properties for reasoning about programs. .. [versus] an algebra of programs whose variables range over programs and whose operations are combining forms. ... [and for example]

A Functional Program for Inner Product

Def Innerproduct = (Insert +) o (ApplyToAll *) o Transpose

[where] Composition(o), Insert, and ApplyToAll (a) are functional forms that combine existing functions to form new ones... "(ACM 1977 Turing Lecture)

And we can't forget the support of a highly distinguished dissertation adviser and committee, Paul Hudak, David Keyes, and Bill Gropp. (Just google those names to track the progress of parallel processing over 3 decades.)

Drum roll, please...

Alan Perlis said to George Mou, "Divacon is the golden key to the future of computer science".

Evidently Divacon parallelization is powerful. Its elements are simple, and its way of thinking is simple. Let's empower ourselves by learning and practicing it.

A Language For Parallelism

This conceptual approach, language, notation are from George Mou's 1990 Yale Computer Science PhD thesis.

We will try to get grounded in George's "DC" or Divide-and-Conquer or "Divacon" ideas and formalism, which give a mathematical form and elegance to parallel decomposition of algorithms.

Incidentally, Divacon notation has been made executeable several times, not just with the Connection Machine of the 1980's and 90's, so let us recognize it is a grounded reality, not just a powerful tool of hyperabstract thought. After this DC language training, I'll apply this approach to the Fourier transform, as an exercise.

Before we begin, let me give ourselves all an attitude adjustment. What we are embarking on here is the understanding of how to break down and solve big problems. Like, to process an array of 10^10 elements, would be an example. So all the definitions and concepts and operations we are going to go over, you should think of them as being not just simple and definite and understandable, but also giving you the power and the leverage and the understanding at scale, for problems at any scale. So yes, please, first accept these ideas simply into your mind, they are stated simply enough, but also recognize, feel their generality and how they might apply to big things as well as small things, and if you recognize that, then you may feel the power that is thrumming under your fingertips as you bring definitions, forms, and components into your mental models and the working spaces of your mind. Their simplicity means that you can put a big harness over a big problem and control it fully by simple reasoning. Okay? Okay.

To begin, we will be needing some math-like notation and terms. The conventions below are taken from George's thesis.

So a (sub)vector might begin its life unnormalized, then become normalized, then be operated upon while normalized, which makes the algorithm simple, then be inverse-normalized back to its original indexing. Separating the calculations from the indexing struggles makes things nice and simple.

George likes to distinguish between communications versus operations. Getting the data you need, and doing something with it, should be separate concerns. Indeed, each has nothing intrinsically to do with the other. So we do relative indexing so as to forget about where the data came from and what indexes it may have had in that location, so that our operations can be simple and pure and local.

Parallel Forms

Next, some combining forms: And some (standard abstract) math won't hurt us:

Finally the simplest of all things, \(id\): Was it simple enough? Maybe you're stuck on morphisms, but with ! and :, can you see how \(f\ =\ c\ :\ !\ f\ :\ d\) would be a parallel decomposition? That's all you really need.

Decomposition: How to Think Clearly About Computing

George's favorite lesson: Don't mix up how you will do something with what you are going to do, Tom!

But his primary teaching point in PDC is this, that computer algorithms are built from three orthogonal components: operation or calculation, communication, and structure manipulation. These are separate; they should be kept separate; and since they are orthogonal they can be specified separately. Further, mixing them produces confusion. Further, if you merely specify these three clearly, you've already done the job of parallel programming. More than just describing some qualities of the program, you will have already written it. Let me say it a few times.

Understanding computing deeply, which means simply, means recognizing the orthogonality of communication, structure manipulation, and local operations. When you communicate data, you aren't manipulating structure, and you're not doing local calculations. When you create and modify structures, you aren't communicating, and you aren't calculating. And when you are doing local calculations, you aren't communicating or modifying structures. If you can be clear about this, things fall into place very easily. Sadly, if not, not -- not in the world of parallel algorithm design.

On the other hand, the mere decomposition actually specifies, actually writes, a Divacon algorithm and a Divacon program. So let's try to be extremely clear about what these orthogonal dimensions are, and let's try to develop a fastidious intuition when we are specifying one of them about not mixing it up with either of the others. If we can do that, it will turn out that we can eat elephants for breakfast.

George seems to say, if you will just get your mind clear on what is being done in these three orthogonal dimensions, your work is basically done for you. Let's go through them twice more, and see if we can be clear minded about them then.

It's a lot. But I've been through it a few times and it is starting to make sense, so I encourage you.

Now, once the three orthogonal dimensions of computing are clarified, writing a Divacon program basically reduces to checking 6 boxes on a checklist. All you have to is just: define the divide and combine functions which manipulate (divide and combine) structures, the communication function which specifies what other data from where will be needed to do a computation, and the local computation which can finally be carried out when the data is present. Actually these are easy to separately define, and systematically used in the Divacon structure, which makes the work so easy and quick that George tells me:

Tom, if it takes longer than one minute, then your answer is wrong.
\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \)George Mou

Here's the checklist. Divacon defines a higher-order-function generator, PDC(), which accepts six functions, and returns a parallel program implementing the algorithm embodied in those functions. Define and apply a PDC as follows:

\(pdc\ =\ PDC(\)divide, combine, preadjust, postadjust, basepred, basefunc\()\)
\(Y\ =\ pdc\ X\)

A PDC so defined is equivalent to the following expression, which constitutes a general if not exhaustive theory of parallel algorithm decomposition.

Divacon:
\(pdc = (\)basepred ? \(then\) basefunc
\(else\) combine : postadjust : \(\ !\ pdc\ \) : preadjust : divide )

I find myself repeatedly studying this, and it is getting more intuitive.

The adjust functions are in general three phases: 1) communication specification, followed by 2) communication, followed by 3) local operation. They preserve the data domain they are operating within, so for example an FFT on the preadjust and divide side is all rearranging sampled data real numbers representing the time domain signal, while the postadjust and combine side is all operating on frequencies which are complex numbers representing amplitude and phase for each different frequency.

Also, look suspiciously at an adjust function which does not emit the same kind of data structures it consumes. It is not a structure manipulation function!

To create a Divacon algorithm the system creator specifies 6 things (most of which are typically either predefined since generally useful, or quick work once you think in these terms and with this kind of vocabulary). The creator then lists the six as arguments to the PDC higher-order-function (a function that creates a function), and runs the returned function on some data X to yield some output Y:

It may help to draw out a computation graph such as on p79, and track what happens where so you can see what everything means. You should see how recursive definition works, how divide and combine operate on the tree, where basepred applies and what happens then, and finally how the adjustment functions work and where they apply exactly. Then if you can figure out communication, you'll be ahead of me.

Now, let's review the 6 with reference to the 3 orthogonal dimensions of computation.

But first, basepred. This is used to detect leafs of the computation graph, at which divide has done its job and no longer applies. basepred does none of the three orthogonal tasks of computing, neither creating new values, nor figuring out what to fetch, nor modifying structure. It triggers basefunc, is all; and then it would be basefunc that does a local operation.

Structure modification is carried out within divide and combine. Only balanced divide and combine operations yield benefits of parallel programming; \(d_{ht}\) etc. will be as slow as any sequential program, stepping one by one through the data array or list.

In contrast, effective parallel decomposition gains its power by splitting giant things into smaller and repeatedly smaller things, which can all be handled at the same time in parallel. Then you might still have a million multiply-adds, for example, but they might take a single time step, given a million processors. Instead of eating an elephant one bite at a time (sequentially), a million, little, themselves-edible piranhas each simultaneously take their separate bites all at once, and eat the whole thing all at once, except that then their bigger and bigger cannibal brothers eat two then the two that ate two, then ... finally, the whole digested elephant inside two giant beta piranhas, which are eaten themselves in a single monstrous bite by the alpha piranha of them all. This is the miracle of parallel processing. (I know, elephants don't live in the Amazon! It's a metaphor, silly.)

So to get back to divide and combine, all that those should do is take an array and split it into N sub-arrays, or vice versa; they do no local operations; also unlike algorithms like quicksort they don't care about the contents of the arrays, which makes it much easier, again, to reason about what they do.

If it's nice and clean like that, then you can indeed say and be satisfied that \(c\ d\ =\ id\). Then you can divide up, and eat up, the whole elephant, in nice tiny amenable tasty cubes at the lowest level, but in fully confident and no more difficult or complicated, enormous bites at the top. Because it's the same divide and combine at every level, and they always match perfectly with the other. Because \(c\ d\ =\ id\).

A combine task, for example, might include the denormalizing of the second of two vectors upon combining them into one. In this case, it adds subvector length N to all the indexes of the vector that is being denormalized, in order to sequentially combine it with another. Thus vectors indexed with [1 2]- and [1 2]- could be combined to a [1 2 3 4]-indexed vector.

(Isn't that a case of creating new values, which is a local operation? Or do we make an exception for the mechanics of this "relative indexing" system, and just hide those index adjustments?)

Local operations can be carried out within pre-adjust, post-adjust, and basefunc.

I once thought I wanted to put a multiply-add inside the combine operation: Mistake! That confuses matters; so instead the better approach is to put it into a postadjust function, or a preadjust in the next layer up. basefunc is often simply id, and all the actual work is done on the way up or down.

Communication is specified by corr, mirr, br, etc. index generators, or I might think of them as "finders", \(f\), and then used in structure modification in the parallel form \(\#\ !\ f\), which may be part of the adjustment or base functions.

In parallel divide and conquer we need a base predicate to tell us True when we are finally done dividing the problem into sub-sub-problems. operating at the base level of the division tree, the leaf level, and a base function to do that operation at the base level.

"[T]he simplest and most intuitive (base predicate and base function) are atom? and id respectively."(4.2.3, p77)

Then simplest leaf operation would be: if (atom?) then return id;

This is what is specified by PDC(... atom?, id).

(As a bit of a miraculous example in the case of the FFT, basefunc=id because the 0th angle step is \(\theta=0\) and \(X * e^{i0} = X * 1 = X\).)

On Communication

# almost means Communication, a concept I struggled with. The following notes are copied or summarized from section 3.6.3 p56ff.

Let \(f\) be a finder or communication generator mapping index sets to index sets (not always one to one; for one it might want to bring in more than one).

Finder \(f\) will know enough about where it is to figure out where other needed stuff also is located, typically the low bits of the current index, which it can use to refer over to a corresponding element, for example, in a different section of the same array, or a corresponding section of an adjacent array but which shares the same low bits in its index. In that case it doesn't even have to ask anyone where to look for something, it already knows because it knows where it is, itself.

So given an indexable array \(A\) we have \(A : \vec{i}\) which is the indexed subset of \(A\) and we also can have \(A : f\ \vec{i}\) which is the subset of A that the mapped-to indexes pull out, when the index set \(\vec{i}\) gets mapped to some other index set \(f\ \vec{i}\). So that's another subset of A.

Then we can define \(\#\ f\) as:

\(\#\ f\ A\ =\ A'\), where \( A'\ \vec{i}\ =\ ( A\ \vec{i},\ A\ :\ f\ \vec{i}) \)
that is, get an extra copy of the \(\vec{i}\)-indexed values of A, namely \(A\ \vec{i}\), bring them in to a new tuple as its first element, but then also bring in the \(f\)-modified-index values from A, put them together into a 2- or N-tuple, and now you have the other data that you may need later all in one place for the operation you'll be doing later. In the context of such pairs, it is convenient to define:

\(self\ (x,y) = x\)
\(other\ (x,y) = y\)

George says, "The operator # thus is a higher order function that transforms a function over index sets to a function over arrays that performs communication. " (This becomes more clear when you recognize that corr operates (typically) on a pair of data arrays, and specifies for each element that the corresponding element in the other array is just the one in the same position.)

George continues: "A communication function maps each array entry to a pair, where the first component is the entry's original value, and the second is the value of another entry whose index is determined by applying the generator to the entry's own index." (So corr, mirr etc. operate on pairs of arrays by working on each of their entries, positionally, and copying the other array's position-matching element to come in here with this element in a tuple, which provides what is needed for local computation. Note that this is not considered to be structure manipulation and certainly not local computation: it is communication.

You'll see \(\#\ !\ f\) a lot, which also uses the distribution symbol \(!\) to say, apply \(f\) separately to each of the elements of the following array.

So there are several standard communication generators, and you can read the thesis (section 3.6.3) for several more. So \(f\) is commonly:

\(corr\ \vec{i} = \vec{i}\) correspondent communication (the index of the corresponding element in an adjacent array will be the same, or have the same low bits if the two arrays are merely conceptual and they actually occur within a larger including array).
\(mirr\ \vec{i} = $ - \vec{i}\) mirror-image communication ($ is N-1 for array size N).
\(br\ \vec{c}\ \vec{i} = \vec{c}\) broadcast from one entry to all.
\(last\ \vec{c}\ \vec{i} = $ - \vec{c}\) broadcast from an entry with backwards index.
\(null = nil = id\ \ \ \ \) Use this in a tuple of generators to capture the corresponding tuple of indexes.

If it's worth your doing a bit of homework, work through the examples on p59 so you understand them.

If you're still struggling maybe my additional notes on Communication in Mou's thesis will help.


The six functions are mostly trivial or one selection from a small set of canonical functions. Their combination is to be understood as follows.

The ":" composition operator operates right to left: \((f:g)(x) = f(g(x))\). So the first function in the above chain is the divide function.

Having divided a bunch of stuff typically in half, we then apply to each half the next function in the chain: preadjust. Preadjust is going to be a "communication function" which specifies, and then brings in, what other data is needed to do this computation.

The local computation which does the work will be needing all its relevant data to be local, so that means we need to capture everything it needs and bring it in. In particular there will be a tuple containing everything the local operation requires. Further, there will be an array of such tuples, because to do a parallel process means to identically do the same in a bunch of parallel processors, and to do that we need to provide the same kinds of data, whatever the process needs, namely some tuple of all the required data elements, to each of all the parallel processors, which means we need an array of those tuples. Then we can send one of the tuples to each of the processors, and it will be able to do its work separately from all the other processors, which means it can be done in parallel, which is the point.

Whereas if you need to separately think about what you need in the midst of the working of each parallel process, and then copy over what it requires from wherever it is, well now you're talking spaghetti code, and we don't tolerate such poor hygeine here.

Although this was also discussed here the following may be helpful.

divide, and combine are structure manipulation functions: they take an array, the input data, and restructure it. \(d_{eo}\) divides a single input array into two, taking evens to the left and odds to the right; while \(c_{eo}\) combines two input arrays into one, shuffling them into even and odd positions in the output. This has nothing to do with going elsewhere to get any data (communication), or doing anything to the data (local calculations): it is orthogonal. But once you've specified your divide, and combine functions, you have a whole structure up and down the parallel computation graph which makes life simpler for the other tasks. Similarly communication and local calculation are orthogonally specifiable, and better so.

Computation resides in two forms, specifying what to get, and getting it. But the more basic idea is to think about this in the context of parallel programming. In a parallel program, there is going to be some data which every processor will be separately processing, because to be parallel we want them to be working independently. This means we would like that data to be entirely local to each process, so that it can just do its local calculations without interrupting itself to go fetch data from some other place. This means all the data it needs must be provided in a data structure fed to it, separately. This means we want an array of tuples, one for each subprocess, each to contain all the data required by any instance of the subprocess.. Then each one can proceed on its own, which is what makes it a parallel program.

Secondly, consider the interrelationships among the data of a parallel process. Is it not always and necessarily a bunch of data? If it's a bunch of data can we not always go through it and make an array of it all? Of course, yes. Having made an array of it, and by assumption it's all the data our parallel process will be using, isn't there some potential dependency between the different elements of the data? You may need, for processing this part of the array, some data from some corresponding other part of the array? That's the idea here.

Let's put it all together:

pdc = combine : postadjust : ! pdc : preadjust : divide

Reading backwards (that is, in compositional order from right to left), the first, divide function takes the input which is an array of data, and divides it typically in half (but dividing in 1/3 or 1/N's is okay too).

Now it is tuple-building time, so that we can run the process separately on its collected required data elements, stuffed into one tuple for each process. The preadjust function builds the tuple, first by specifying what is the one corresponding element (or more) which needs to be brought in here for the local computation, that is, by an index generating function. Second, by copying the index'ed other data from the other sub-array in to this element's tuple, by a communication function.

So the index generating function takes the normalized index within our subarray (the index of "this" tuple), and somehow calculates the index or indexes of the desired other data, in a somehow-corresponding place in the other. For example, the \(N\)th elements in both subarrays will both have index \(N\), so it's not too complicated to figure it out from one what is the other. (That's called corr for "Correspondent communication".) Another, mirr for "Mirror" pairs up the \(N\)th in this sub-array with the length\(-N\)th in the other, and vice versa, in both cases the "other" is known for any array's Nth element without asking anywhere else for outside data.

Then the communication function builds our tuple. It takes the local array element as the first element in the tuple, and it fetches from the other array the indexed element whose index was generated by the index generating function. So both right and left sub-arrays get manipulated into new structures, now they become arrays of tuples.

Does it seem as if we could wait? No! We want all the communication to be handled at once, and separately from the local calculations. We create nice tidy separate workspaces with everything needed in a nice tuple of operating supplies, so then the local operation task can be clean and simple.

At this point we have specified divide, combine, and communicate. Divide and combine are functions given separately to PDF(), while communicate might be just a compositional part of the preadjust function (the first-applied part), if it is followed by a local computation function to be carried out in the building up of the computation graph.

The postadjust function might also include communication as well as local operation.

See, parallel processing builds up a computation graph which is a tree. The tree at the top is a single node that includes the whole giant problem as one bite, but at the leafs of the tree the problem is usually too trivial to even do anything. You can often swallow without even chewing once!

You might think (I confuse myself by thinking) the tree does calculations at each node thereby doing a recursive breakdown or buildup of results from smaller parts of the tree, but actually those calculations are not so much done at the node, as during the building out of the tree, or during the summarizing or winding back up of the tree. The calculations (except at the very ends, the leafs of the tree), are always and only going to be done on the way up or on the way down. Once again, there will be no separate traversing the tree to do operations at each node outside the preadjust or postadjust processes which themselves build out the branches from the nodes, or collapse the branches into the nodes, respectively.

Finally, how do we go about understanding and defining for our local operation? Let's take those separately.

Understand the local operation by recognizing

That should help you.

Next, defining your local operation. This will be the nub of the idea of the parallel decomposition. If you stare at your equation, you'll find a repeating pattern which allows the problem to be broken down into multiple smaller instances of the same problem. How those instances return a result which is combined to make the larger solution, that combination work is the work of the local operation.

Actually, you really don't have to solve the problem, at all, just see that it contains a simpler or smaller version of itself.

In FFT it's shift (multiply by a constant) the sum of the odds and add to the sum of the evens. In Reverse it's delete self accept other (referring to the 2 elements of a communicated 2-tuple).

I forgot to mention, anything that is needed that is NOT part of the (variable-content) array can be specified as a constant in the local operation function itself. Everything else must come from the input array.

basepred and basefunc are discussed here. So that's everything. Structure manipulation in the divide and combine functions; orthogonal to communication which is the first applying part of the preadjust or postadjust functions; orthogonal to local operation which is the second-applying part of the preadjust or postadjust functions.

Bibliography

John Backus 1977 Turing Lecture.

H. BURKHARDT,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.

H. BURKHARDT,"Trigonometrische Reihenund Integrale(bis etwa 1850)," in Chap- ter 12, Encyklopiidie der Mathematischen Wissenschaften,vol. 2, part 1, 2nd half, 1904-1916.

J. W. COOLEY & J. W. TUKEY, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. Comput., Vol. 19, no. 2, pp. 297-301, Reprinted in Digital SignalProcessing,ed.L. R.RABINER& C. M. RADER,pp.223-227,New York: IEEE Press,1972,Apr. 1965.

C. F. Gauss, "Nachlass, Theoria Interpolationis Methodo Nova Tractata," in Carl Friedrich Gauss Werke, Band 3, Koniglichen Gesellschaft der Wissenschaften: Gottingen, pp. 265-330, 1866.

MICHAEL T. HEIDEMAN, DON H. JOHNSON, C. SIDNEY BURRUS "Gauss and the History of the Fast Fourier Transform", 1985. Rice U Eng. Dept.

Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication Viterbi Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel. ArXiv 2304.02336v1 [cs.AR] 5 Apr 2023. Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing Lecture. (ACM V 14 No 1, 1967).

Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795. Z. George Mou, and Xiaojing Wang, Optimal Mappings of m Dimensional FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp 104-119. in PARLE '93, Parallel Architectures and Languages Europe, Lecture Notes in Computer Science #694, Springer-Verlag. 5th International PARLE Conference, Munich, Germany, June 1993 Proceedings.
Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: June 13, 2024; Modified June 17, 2024