TL;DR

Divacon Parallel Algorithm Design:

Your problem is to write a function that does something with an array. (Is it even possible for parallel computation to be anything else?)

To solve any computing problem, you will necessarily make use of three orthogonal constituents, none of which includes anything of the others.

C: Communication which specifies what elements talk with what other elements (below: $precomm, postcomm$).
L: Local calculations (below: $f_b, preop, postop$).
S: Structure manipulation (below: $d,c$ which divide and combine arrays of data).

To understand your problem is to break your task into these constituents: C, L, and S. Any time these are mixed up together in your thinking, you may consider that to be a state of Confusion. To build the best parallel algorithm, hang those parts where they belong on the general, recursive parallel computation scaffold, which is the Divacon equation:

$$\begin{align} f\ A\ =\ & base?\ f_b\ \\ & else\ \ \ \ c\ :\ post\ :\ !f\ :\ pre\ :\ d\ \ A. \\ \end{align}$$

Where:

$f$ is your function, $A$ is the data array it operates upon.

":" means function composition. $f : g\ x = f(g(x))$. We don't like unnecessary parentheses.

$f$ contains $f$, means this is a recursive definition.

$!f$ means apply $f$ separately to any number of inputs, that is, if you want, in parallel.

$f =\ ..!f..$ produces a tree structured computation graph: each (parent) instance of $f$ has some number of daughter instances $!f$.

Here $f$ is more an organizing principle of a control pattern than what you might normally think of as a meaningful self-contained function. All the work in $f$ is done by the parts of $f$, always outside the included $f$'s, in what we call $c, d, pre, post$ parts -- except for the tiniest trivial bit ($f_b$) at the base of the division tree. The point of $f$ is merely that it is recursive and parallel, and otherwise it does nothing but orchestrate the sequence, the subdivision, the mutual dependencies of its parts.

$c$ combines once for each time $d$ divides. After $d$ and before $c$ we have data parallelism: each subdivision of the data can be operated on separately.

$f_b$, the base function, converts a data value from input type to output type once division is down to a single data value.

$pre = preop : precomm\ \ \ $ and $\ \ \ post = postop : postcomm\ \ \ \ \ $ Communication feeds (precedes) local operations in both cases.

$preop, postop$ prepare for or recover from the recursive calls $!f$. Connecting $d$ to $f$, or $f$ to $c$, these enable separated sub-arrays to be separately fed to instances of $f$, or enable sets of $f$ outputs to be combined so as to have the form and meaning of the function's output. These do local calculations only, isolated from any communication or division or combination. Each converts data of the input or output type from one level to another but retains the same general data type on the input or output sides, respectively.

$precomm, postcomm$ do communication only, to bring in any additional data needed so that local calculations can be local.

Working right to left in the formula, we take in the entire input array A, and apply division until you can't divide any more. Division alone, since $pre=id$: once $pdc$ execution completes a division of its input array, the next step is not $pre$ because that is the no-op $id$, rather the next step is another application of $pdc$, which includes itself not just in a recursion but in a parallel, distribution re-application of itself, separately, to each of the divided half-arrays. This doesn't stop until the divided half-arrays are atomic (length 1).

On this general scaffold, four more kinds of things are queued up on the execution stack. First, $f_b$: At the limit of division, you need a function to convert from input data type to output data type (it might just be a conceptual change of units not always a numerical change of values, hence often $f_b=id$). Second, $L$: Going up or in general also going down you may need a Local operation to be done at each level. Those local calculations will depend on local values, certainly, but also perhaps on values that need to be communicated from elsewhere in this or another array. So the third requirement is $C$ Communication. Dependencies must be clarified as to what data is needed, so that each position's resulting value can indeed be calculated locally; that is, what needs to be brought in and from where. Last, $S$, the Structure manipulation of array combination, in this case $c_{lr}$.

Once you get clarity on the CLS breakdown of your problem, and given the background knowledge of how the Divacon scaffold works, it is easy to express (often merely to name) the aspects of the solution in the Divacon language.

In fact a Divacon program can be considered a mere parameterization of the Divacon scaffold. Pick the elements you like, and add a custom element or two. Divacon is a Higher Order functional language, so you can send functions as arguments, receive functions as return values, and apply function objects to input objects. At the same time, the C, L, and S elements are themselves First Class Citizens in Divacon, unlike other languages presently known. Normal Higher Order Functional languages provide First Class citizenship to Local operations. But Divacon gives that status also to Communication and to Structure manipulation. Hence not just L but also C and S can be declared, defined, given as arguments, returned as results from functions including higher order functions. This is the uniqueness of Divacon.

While the above Divacon equation is mathematics, and defines what $f$ means, the following expression is code, an assignment statement that creates the software object $f$ which is a software function implementing the mathematics.

$$\begin{align} f\ :=\ & PDC(d,c,pre,post,base?,f_b) \\ \end{align}$$
Where:

$d, c$ do S, Structure manipulation. $d$ has type array to array pair, while $c$ has type array pair to array (??or summary pair to summary??).

$pre\ =\ preop\ :\ \#\ [!]\ precomm$

$post\ =\ postop\ :\ \#\ [!]\ precomm$

$precomm, postcomm$ give directions to C, Communication. Actually they are communication generators used in the communication process to identify the particular element in the other array of the current pair to communicate with. Hopefully they can do this independently, locally within each element, and in parallel. They are functions of type index (or index set) to index (or index set).

$\#\ precomm$ actually carries out C, Communication. For each index in the first array it copies the value there over to the position in the second array specified by running the communication generator $precomm$ on that index. Its type is array pair to array pair.

$\#\ !\ precomm$ also does C but distributed (that is, bidirectional communication). For each index in one array it finds its communication partner by running the communication generator $precomm$ on that index, then it copies the partners' values to each other. Its type is array pair to array pair.
Although this requires both partners to be involved, they can do their work independently of other pairs, in a lockstep communication pattern, that is, where the whole marching band can step forward and backward together and simultaneously, all along the same axis or in the same direction, without any queuing, blocking, or often even other than immediate-neighbor communications (unless your hardware vendor prevents all efficiency by disallowing direct-neighbor communications). For the full symphonic elaboration of this, see Optimal Mappings and my retelling of some of George's story about it.

$preop$ does L, Local operations, upon the divided and communication-enhanced arrays. $preop$ prepares for the distributed recursion of $f$ itself on each, so it would be sensible to give it the tasks that require the other half-array to be available, since in $f$ the other half-array is not available. Its type is array pair to array pair.

$!f$ applies inside $f$ by distributively re-applying itself in parallel on each half-array H0 and H1 that came out of division and pre-adjustment. Its type is array to array or array to summary, the same inside the function as it is at the top level, but it applies separately twice, once upon each half-array, and yielding the results that a half-array maps to.

$postop$ also does L, Local operations, but upon applies to $f$ outputs instead of $d$ outputs (as preadjusted). $f$ having been recursively applied to each half array each return a half-array-result, which postop works on at each level before being combined. Its type is array to array or summary to summary.

You now have a language and methodology for parallel algorithm design. You will find it enables optimal and concise results.
Okay that was your quick review. If you have a spark burning, now, read on.
Introduction
These ideas, approach, formalization, notation, and system, come to me from the great George Mou, whom I am lucky and proud to know as a friend.
George's work in the math of algorithm parallelization using his elegant and powerful Divide and Conquer formalization, Divacon, is both revolutionary and generally unknown. For example, today's version of the Wikipedia entry for Divide-And-Conquer parallelization fails to even mention George's work. The Divacon approach enables miraculously clear thinking.
Let's not be like that. Properly educated, we could build unbelieveably simple, provably correct, highly optimized parallelizations of important as well as new algorithms. So let's get educated!

Forgive me for a bit of a rant. Because this is the exact kind of BS that has happened repeatedly in the past in the very subject I'm going to work on as my homework with you here, the FFT. Did you know that actually it was Gauss who invented the Fourier Transform before Fourier, Gauss who further invented the discrete FT, Gauss who further invented the Fast FT just like we use today, in 1805, all at once when Fourier was doing the first bit in 1807. But noone paid attention to it until Cooley and Tukey reinvented it in 1965. Even after that it was further discovered that not just one or two or several but many (ELEVEN!) other reinventions of the same ideas had been forgotten -- including a review of all the forgotten reinventions, saying how terrible it was that they were forgotten -- which itself was forgotten for 60 more years! Here's a review of that reviewer:
"A second reference to Gauss' algorithm ... was written by H. Burkhardt in 1904. Burkhardt comments that Gauss' method was general, but seemingly not known by practitioners. It is interesting to note that Goldstine's and Burkhardt's works went almost as unnoticed as Gauss' work itself." (Heidemann, Johnson, & Burrus, 1985)
It may be without irony that George tells me he thinks the world may be ready for this in 200 years. No, it would be typical. (Am I sensitive?)
When I see this it kills me. It is on us to capture and use these achievements, to go through these doorways of genius. If we forget it like Gauss's notebooks or Danielson and Lanczos (1942) or the many others, then it is us, we, the dummies who should regret it, and after folks rediscover it later, we will be ashamed. Our age will be thought of with condescension, or disbelief, or laughter, like those Aztecs or Incas who had wheeled toys but never built wheeled vehicles: so close and yet so far. Worse in our case, because our new powerful ideas have actually been built and used, yet we still ignored it.
So let's actually not!
Let's push it forward, and bring it into our understanding and our practice.
In service of this, my small contribution here: motivation, explanation, tutorial of how to think about parallel problems, a little more verbose redundancy to make it easier to read than George's terse mathematics. I hope this brings George the credit he deserves, sooner than 200 years hence.

Inspiration
George Mou's 1990 Yale Computer Science PhD thesis was written in the Golden Age of parallel programming which was going on at Yale in the late 1980's and early 1990's.
In that golden age Alan Perlis was alive, saying things like

"algorithms often fail to exhibit a clarity of structure which they really possess."
"While I have spoken glibly about variability attached to the identifiers of the language, I have said nothing about the variability of control. We do not really have a way of describing control, so we cannot declare its regimes. We should expect our successor to have the kinds of control that ALGOL has -- and more. Parallel operation is one kind of control about which much study is being done."(p 8)
"A natural technique to employ in the use of definitions is to start with a language X, consider the definitions as enlarging the syntax to that of a language X' and give the evaluation rules as a reduction process which reduces any text in X' to an equivalent one in X."(p 9)
"Programmers should never be satisfied with languages which permit them to program everything, but to program nothing of interest easily."(p9)
Also remember how Backus admonished us of the limits of von Neuman architecture (1-word-at-a-time communication, implied especially by the assignment operator), "Conventional programming languages .. Inherent defects .. inability to effectively use powerful combining forms for building new programs from existing ones, and their lack of useful mathematical properties for reasoning about programs. .. [versus] an algebra of programs whose variables range over programs and whose operations are combining forms. ... [and for example]

A Functional Program for Inner Product
Def Innerproduct = (Insert +) o (ApplyToAll *) o Transpose
[where] Composition(o), Insert, and ApplyToAll (a) are functional forms that combine existing functions to form new ones... "(ACM 1977 Turing Lecture)

And we can't forget the support of a highly distinguished dissertation adviser and committee, Paul Hudak, David Keyes, and Bill Gropp. (Just google those names to track the progress of parallel processing over 3 decades.)
Drum roll, please...
Alan Perlis said to George Mou, "Divacon is the golden key to the future of computer science".
Evidently Divacon parallelization is powerful. Its elements are simple, and its way of thinking is simple. Let's empower ourselves by learning and practicing it.

A Language For Parallelism
This conceptual approach, language, notation are from George Mou's 1990 Yale Computer Science PhD thesis.
We will try to get grounded in George's "DC" or Divide-and-Conquer or "Divacon" ideas and formalism, which give a mathematical form and elegance to parallel decomposition of algorithms.
Incidentally, Divacon notation has been made executeable several times, not just with the Connection Machine of the 1980's and 90's, so let us recognize it is a grounded reality, not just a powerful tool of hyperabstract thought. After this DC language training, I'll apply this approach to the Fourier transform, as an exercise.
Before we begin, let me give ourselves all an attitude adjustment. What we are embarking on here is the understanding of how to break down and solve big problems. Like, to process an array of 10^10 elements, would be an example. So all the definitions and concepts and operations we are going to go over, you should think of them as being not just simple and definite and understandable, but also giving you the power and the leverage and the understanding at scale, for problems at any scale. So yes, please, first accept these ideas simply into your mind, they are stated simply enough, but also recognize, feel their generality and how they might apply to big things as well as small things, and if you recognize that, then you may feel the power that is thrumming under your fingertips as you bring definitions, forms, and components into your mental models and the working spaces of your mind. Their simplicity means that you can put a big harness over a big problem and control it fully by simple reasoning. Okay? Okay.
To begin, we will be needing some math-like notation and terms. The conventions below are taken from George's thesis.

(): Skip parentheses in function application. Write f(x) more cleanly as f x.

(..) means construct an object whether tuple or vector or recursive array or etc. from the ..-enumerated elements.

A tuple is a grouping data structure comprising subitems which may not be of identical structure.

A vector is a tuple constrained to comprise subitems of identical structure, and to be indexed by a (binary) integer, or by an index set which pulls out any number of its elements.

A recursive array is a smart vector, that is, which knows its length and both can be divided (until atom? returns True) and can be combined.

Normalized indexes count from 0 to N-1. This lets you think clearly by not having to think about indexes, a great source of noise, confusion, and spaghetti code.

Relative indexes are normalized by some 1-to-1 mapping then inverse-normalized by the inverse mapping.

So a (sub)vector might begin its life unnormalized, then become normalized, then be operated upon while normalized, which makes the algorithm simple, then be inverse-normalized back to its original indexing. Separating the calculations from the indexing struggles makes things nice and simple.
George likes to distinguish between communications versus operations. Getting the data you need, and doing something with it, should be separate concerns. Indeed, each has nothing intrinsically to do with the other. So we do relative indexing so as to forget about where the data came from and what indexes it may have had in that location, so that our operations can be simple and pure and local.

Parallel Forms
Next, some combining forms:

:: function composition. f : g means first apply g, then apply f to the output of g.

!: distributive application. $!\ f\ (x_0,..x_{n-1})\ =\ (f\ x_0,..\ f\ x_{n-1})$ or rarely $!\ (f_0,\ ...\ f_{n-1})\ (x_0,..x_{n-1})\ =\ (f_0\ x_0,..\ f_{n-1}\ x_{n-1})$ if there are different f's.

$!\ f\ X$ is the essence of parallel decomposition.
The symbol $!$ here says apply f to each of the separate elements of X. It doesn't need them to be done in any particular order, and in particular they can be done separately and all at once, that is to say by parallel processors, if you have them available to do the work, or not, if not. (We are doing math here; implementation is a separate concern and typically there is a forest of implementation possibilities which initially would only distract our minds from the simple essences. So do this as math.)
If you can take apart your algorithm so that it contains itself not just recursively but inside a distribution, as in $f\ =\ ...\ !\ f\ ...$ then you have a parallel program.

And some (standard abstract) math won't hurt us:

A relation, say $r,R; .,$_; $+,*$, etc. takes some input(s) and yields some output(s).

A morphism is a function which can be done in either of two domains. So f is a morphism if $f ( x_1 $ r $ x_2 ) = (f x_1) $ R $ (f x_2) $.
That is, operate on your inputs in the r domain, then apply f to the output, or do f on the inputs, and operate in the R domain; either way you get the same thing.
Here's one. Do you have your logarithm concepts and rules in memory? If not, go look them up again, they are cool. If I say this:

$\ log\ :\ *\ =\ +\ :\ !\ log$

Then do you recognize the ":" of function composition? Do I need to write the vector of inputs (2 will do) to which a function applies, as follows?

log(product($x_1,x_2$)) = sum(log($x_1$),log($x_2$)))?

I'd like to remove the messy and confusing parentheses: log : prod X = sum : ! log X, and let the function apply to all the members of the vector. So that's a bit of notation.
But the meaning is there too. Remember from calculus that the log of a product is the sum of the logs. Right? For example, $log(a*b*c)=log(a)+log(b)+log(c)$. Morphisms are helpful if it is easier to go do something in a different domain and then come back to the same result.
This is getting close to $f\ A\ =\ c\ :\ !\ f\ :\ d\ A$ which would say that $f$ on arrays is (can safely be defined as) a morphism of itself on divided sub-arrays that are then combined. That would be a mathematical proof for a parallel decomposition of a function.

Finally the simplest of all things, $id$:

id means the identity function: its output is identical to its input.

Was it simple enough? Maybe you're stuck on morphisms, but with ! and :, can you see how $f\ =\ c\ :\ !\ f\ :\ d$ would be a parallel decomposition? That's all you really need.

Decomposition: How to Think Clearly About Computing
George's favorite lesson: Don't mix up how you will do something with what you are going to do, Tom!
But his primary teaching point in PDC is this, that computer algorithms are built from three orthogonal components: operation or calculation, communication, and structure manipulation. These are separate; they should be kept separate; and since they are orthogonal they can be specified separately. Further, mixing them produces confusion. Further, if you merely specify these three clearly, you've already done the job of parallel programming. More than just describing some qualities of the program, you will have already written it. Let me say it a few times.
Understanding computing deeply, which means simply, means recognizing the orthogonality of communication, structure manipulation, and local operations. When you communicate data, you aren't manipulating structure, and you're not doing local calculations. When you create and modify structures, you aren't communicating, and you aren't calculating. And when you are doing local calculations, you aren't communicating or modifying structures. If you can be clear about this, things fall into place very easily. Sadly, if not, not -- not in the world of parallel algorithm design.
On the other hand, the mere decomposition actually specifies, actually writes, a Divacon algorithm and a Divacon program. So let's try to be extremely clear about what these orthogonal dimensions are, and let's try to develop a fastidious intuition when we are specifying one of them about not mixing it up with either of the others. If we can do that, it will turn out that we can eat elephants for breakfast.

At the bottom of the analysis tree, and in various algorithms on the way up or down or both also, one actually does some calculations with the data that has been placed on your plate to make new values. That's operation.

Before the operation, the patient must be on the table. Deciding what parts to get, that's communication.

During a war, the soldiers must be arranged into groups of different sizes by dividing up or combining them together. That is structure manipulation.

George seems to say, if you will just get your mind clear on what is being done in these three orthogonal dimensions, your work is basically done for you. Let's go through them twice more, and see if we can be clear minded about them then.

Communication can be understood as having two phases. The second phase is to go get that needed data and put a copy of it here, in or adjacent to the thing that needs it. But getting what's needed depends on the key first phase which picks out the indexes of what will be needed in the local operations.

Local operations a.k.a. "universal functions" (p 53) "Computations over arrays also need operations of an orthogonal nature, that is, operations that map array entries to new values."
In the expression $\#\ !\ f$ this will be the $f$ (for "find", not for "function", and also it's not "get", $f$ just says where what you want is to be found). Then $\#$ says to actually get it, and make a 2-tuple with it and the thing you've been sent from. And "!" says do the same (opposite) communication to both sides). See below for details and examples.

Structure manipulation (a.k.a. relational functions, p53) exists in a couple of forms: tuple creation as part of communication, and the divide and combine methods.
First, given a communication generator such as $f$ (we'll be using corr, mirr, br, mainly), the functional forms "$\#\ f\ x$" (and "$\#\ !\ f\ X$") construct a new 2-tuple for each element $x$ (or element of $X$), to contain the elements picked out by $f$. Like, if your system wants to add this and that, it'll apply $+$ to the tuple (this,that), which was created by $\#\ f\ index\_of\_this$. Like this:

$\ +\ :\ (\#\ !\ f)\ \vec{i}_{these}$
Are you following? Summation applies to the pairs whose first element is one of $X_{i}$ and whose second element is whatever $f$ aims you at, when presented with one of $\vec{i}_{these}$, such as the corresponding element in another sub-array (if f = corr), or the opposite/mirroring element (mirr), or all the elements (br) there.
Second, the divide and combine methods $c : d = d^{-1} : d = id $ take arrays and return divided or combined arrays, without modifying their contents. Thus no local operations are part of $d$ or $c$, since they don't change any array values. They do change the indexing schemes for the divided or combined arrays, so as to count in order from $0$ to $N-1$ where the length of the output array is $N$.
It is preferred (for reasoning to be made either simpler, or possible) for $d$ and $c$ to be independent of the data. The typical ones are $d_{eo}, d_{lr}$ with $d_{eo}^{-1}=c_{eo}$, and $d_{lr}^{-1}=c_{lr}$ for Even-Odd and Left-Right divide and combine, but you may also encounter $d_{ht}$ with $d_{ht}^{-1}=c_{ht}$ being unbalanced head-tail division and combination.

It's a lot. But I've been through it a few times and it is starting to make sense, so I encourage you.

Now, once the three orthogonal dimensions of computing are clarified, writing a Divacon program basically reduces to checking 6 boxes on a checklist. All you have to is just: define the divide and combine functions which manipulate (divide and combine) structures, the communication function which specifies what other data from where will be needed to do a computation, and the local computation which can finally be carried out when the data is present. Actually these are easy to separately define, and systematically used in the Divacon structure, which makes the work so easy and quick that George tells me:

Tom, if it takes longer than one minute, then your answer is wrong.
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ $George Mou
Here's the checklist. Divacon defines a higher-order-function generator, PDC(), which accepts six functions, and returns a parallel program implementing the algorithm embodied in those functions. Define and apply a PDC as follows:

$pdc\ =\ PDC($divide, combine, preadjust, postadjust, basepred, basefunc$)$
$Y\ =\ pdc\ X$
A PDC so defined is equivalent to the following expression, which constitutes a general if not exhaustive theory of parallel algorithm decomposition.
Divacon:

$pdc = ($basepred ? $then$ basefunc

$else$ combine : postadjust : $\ !\ pdc\ $ : preadjust : divide )

I find myself repeatedly studying this, and it is getting more intuitive.
The adjust functions are in general three phases: 1) communication specification, followed by 2) communication, followed by 3) local operation. They preserve the data domain they are operating within, so for example an FFT on the preadjust and divide side is all rearranging sampled data real numbers representing the time domain signal, while the postadjust and combine side is all operating on frequencies which are complex numbers representing amplitude and phase for each different frequency.
Also, look suspiciously at an adjust function which does not emit the same kind of data structures it consumes. It is not a structure manipulation function!
To create a Divacon algorithm the system creator specifies 6 things (most of which are typically either predefined since generally useful, or quick work once you think in these terms and with this kind of vocabulary). The creator then lists the six as arguments to the PDC higher-order-function (a function that creates a function), and runs the returned function on some data X to yield some output Y:
It may help to draw out a computation graph such as on p79, and track what happens where so you can see what everything means. You should see how recursive definition works, how divide and combine operate on the tree, where basepred applies and what happens then, and finally how the adjustment functions work and where they apply exactly. Then if you can figure out communication, you'll be ahead of me.
Now, let's review the 6 with reference to the 3 orthogonal dimensions of computation.
But first, basepred. This is used to detect leafs of the computation graph, at which divide has done its job and no longer applies. basepred does none of the three orthogonal tasks of computing, neither creating new values, nor figuring out what to fetch, nor modifying structure. It triggers basefunc, is all; and then it would be basefunc that does a local operation.
Structure modification is carried out within divide and combine. Only balanced divide and combine operations yield benefits of parallel programming; $d_{ht}$ etc. will be as slow as any sequential program, stepping one by one through the data array or list.
In contrast, effective parallel decomposition gains its power by splitting giant things into smaller and repeatedly smaller things, which can all be handled at the same time in parallel. Then you might still have a million multiply-adds, for example, but they might take a single time step, given a million processors. Instead of eating an elephant one bite at a time (sequentially), a million, little, themselves-edible piranhas each simultaneously take their separate bites all at once, and eat the whole thing all at once, except that then their bigger and bigger cannibal brothers eat two then the two that ate two, then ... finally, the whole digested elephant inside two giant beta piranhas, which are eaten themselves in a single monstrous bite by the alpha piranha of them all. This is the miracle of parallel processing. (I know, elephants don't live in the Amazon! It's a metaphor, silly.)
So to get back to divide and combine, all that those should do is take an array and split it into N sub-arrays, or vice versa; they do no local operations; also unlike algorithms like quicksort they don't care about the contents of the arrays, which makes it much easier, again, to reason about what they do.
If it's nice and clean like that, then you can indeed say and be satisfied that $c\ d\ =\ id$. Then you can divide up, and eat up, the whole elephant, in nice tiny amenable tasty cubes at the lowest level, but in fully confident and no more difficult or complicated, enormous bites at the top. Because it's the same divide and combine at every level, and they always match perfectly with the other. Because $c\ d\ =\ id$.

A combine task, for example, might include the denormalizing of the second of two vectors upon combining them into one. In this case, it adds subvector length N to all the indexes of the vector that is being denormalized, in order to sequentially combine it with another. Thus vectors indexed with [1 2]- and [1 2]- could be combined to a [1 2 3 4]-indexed vector.
(Isn't that a case of creating new values, which is a local operation? Or do we make an exception for the mechanics of this "relative indexing" system, and just hide those index adjustments?)

Local operations can be carried out within pre-adjust, post-adjust, and basefunc.
I once thought I wanted to put a multiply-add inside the combine operation: Mistake! That confuses matters; so instead the better approach is to put it into a postadjust function, or a preadjust in the next layer up. basefunc is often simply id, and all the actual work is done on the way up or down.
Communication is specified by corr, mirr, br, etc. index generators, or I might think of them as "finders", $f$, and then used in structure modification in the parallel form $\#\ !\ f$, which may be part of the adjustment or base functions.
In parallel divide and conquer we need a base predicate to tell us True when we are finally done dividing the problem into sub-sub-problems. operating at the base level of the division tree, the leaf level, and a base function to do that operation at the base level.
"[T]he simplest and most intuitive (base predicate and base function) are atom? and id respectively."(4.2.3, p77)
Then simplest leaf operation would be: if (atom?) then return id;
This is what is specified by PDC(... atom?, id).

(As a bit of a miraculous example in the case of the FFT, basefunc=id because the 0th angle step is $\theta=0$ and $X * e^{i0} = X * 1 = X$.)

On Communication
# almost means Communication, a concept I struggled with. The following notes are copied or summarized from section 3.6.3 p56ff.
Let $f$ be a finder or communication generator mapping index sets to index sets (not always one to one; for one it might want to bring in more than one).
Finder $f$ will know enough about where it is to figure out where other needed stuff also is located, typically the low bits of the current index, which it can use to refer over to a corresponding element, for example, in a different section of the same array, or a corresponding section of an adjacent array but which shares the same low bits in its index. In that case it doesn't even have to ask anyone where to look for something, it already knows because it knows where it is, itself.
So given an indexable array $A$ we have $A : \vec{i}$ which is the indexed subset of $A$ and we also can have $A : f\ \vec{i}$ which is the subset of A that the mapped-to indexes pull out, when the index set $\vec{i}$ gets mapped to some other index set $f\ \vec{i}$. So that's another subset of A.
Then we can define $\#\ f$ as:

$\#\ f\ A\ =\ A'$, where $ A'\ \vec{i}\ =\ ( A\ \vec{i},\ A\ :\ f\ \vec{i}) $
that is, get an extra copy of the $\vec{i}$-indexed values of A, namely $A\ \vec{i}$, bring them in to a new tuple as its first element, but then also bring in the $f$-modified-index values from A, put them together into a 2- or N-tuple, and now you have the other data that you may need later all in one place for the operation you'll be doing later. In the context of such pairs, it is convenient to define:

$self\ (x,y) = x$

$other\ (x,y) = y$

George says, "The operator # thus is a higher order function that transforms a function over index sets to a function over arrays that performs communication. " (This becomes more clear when you recognize that corr operates (typically) on a pair of data arrays, and specifies for each element that the corresponding element in the other array is just the one in the same position.)
George continues: "A communication function maps each array entry to a pair, where the first component is the entry's original value, and the second is the value of another entry whose index is determined by applying the generator to the entry's own index." (So corr, mirr etc. operate on pairs of arrays by working on each of their entries, positionally, and copying the other array's position-matching element to come in here with this element in a tuple, which provides what is needed for local computation. Note that this is not considered to be structure manipulation and certainly not local computation: it is communication.
You'll see $\#\ !\ f$ a lot, which also uses the distribution symbol $!$ to say, apply $f$ separately to each of the elements of the following array.
So there are several standard communication generators, and you can read the thesis (section 3.6.3) for several more. So $f$ is commonly:

$corr\ \vec{i} = \vec{i}$ correspondent communication (the index of the corresponding element in an adjacent array will be the same, or have the same low bits if the two arrays are merely conceptual and they actually occur within a larger including array).

$mirr\ \vec{i} = $ - \vec{i}$ mirror-image communication ($ is N-1 for array size N).

$br\ \vec{c}\ \vec{i} = \vec{c}$ broadcast from one entry to all.

$last\ \vec{c}\ \vec{i} = $ - \vec{c}$ broadcast from an entry with backwards index.

$null = nil = id\ \ \ \ $ Use this in a tuple of generators to capture the corresponding tuple of indexes.

If it's worth your doing a bit of homework, work through the examples on p59 so you understand them.
If you're still struggling maybe my additional notes on Communication in Mou's thesis will help.

The six functions are mostly trivial or one selection from a small set of canonical functions. Their combination is to be understood as follows.
The ":" composition operator operates right to left: $(f:g)(x) = f(g(x))$. So the first function in the above chain is the divide function.
Having divided a bunch of stuff typically in half, we then apply to each half the next function in the chain: preadjust. Preadjust is going to be a "communication function" which specifies, and then brings in, what other data is needed to do this computation.
The local computation which does the work will be needing all its relevant data to be local, so that means we need to capture everything it needs and bring it in. In particular there will be a tuple containing everything the local operation requires. Further, there will be an array of such tuples, because to do a parallel process means to identically do the same in a bunch of parallel processors, and to do that we need to provide the same kinds of data, whatever the process needs, namely some tuple of all the required data elements, to each of all the parallel processors, which means we need an array of those tuples. Then we can send one of the tuples to each of the processors, and it will be able to do its work separately from all the other processors, which means it can be done in parallel, which is the point.
Whereas if you need to separately think about what you need in the midst of the working of each parallel process, and then copy over what it requires from wherever it is, well now you're talking spaghetti code, and we don't tolerate such poor hygeine here.
Although this was also discussed here the following may be helpful.
divide, and combine are structure manipulation functions: they take an array, the input data, and restructure it. $d_{eo}$ divides a single input array into two, taking evens to the left and odds to the right; while $c_{eo}$ combines two input arrays into one, shuffling them into even and odd positions in the output. This has nothing to do with going elsewhere to get any data (communication), or doing anything to the data (local calculations): it is orthogonal. But once you've specified your divide, and combine functions, you have a whole structure up and down the parallel computation graph which makes life simpler for the other tasks. Similarly communication and local calculation are orthogonally specifiable, and better so.
Computation resides in two forms, specifying what to get, and getting it. But the more basic idea is to think about this in the context of parallel programming. In a parallel program, there is going to be some data which every processor will be separately processing, because to be parallel we want them to be working independently. This means we would like that data to be entirely local to each process, so that it can just do its local calculations without interrupting itself to go fetch data from some other place. This means all the data it needs must be provided in a data structure fed to it, separately. This means we want an array of tuples, one for each subprocess, each to contain all the data required by any instance of the subprocess.. Then each one can proceed on its own, which is what makes it a parallel program.
Secondly, consider the interrelationships among the data of a parallel process. Is it not always and necessarily a bunch of data? If it's a bunch of data can we not always go through it and make an array of it all? Of course, yes. Having made an array of it, and by assumption it's all the data our parallel process will be using, isn't there some potential dependency between the different elements of the data? You may need, for processing this part of the array, some data from some corresponding other part of the array? That's the idea here.
Let's put it all together:

pdc = combine : postadjust : ! pdc : preadjust : divide
Reading backwards (that is, in compositional order from right to left), the first, divide function takes the input which is an array of data, and divides it typically in half (but dividing in 1/3 or 1/N's is okay too).
Now it is tuple-building time, so that we can run the process separately on its collected required data elements, stuffed into one tuple for each process. The preadjust function builds the tuple, first by specifying what is the one corresponding element (or more) which needs to be brought in here for the local computation, that is, by an index generating function. Second, by copying the index'ed other data from the other sub-array in to this element's tuple, by a communication function.
So the index generating function takes the normalized index within our subarray (the index of "this" tuple), and somehow calculates the index or indexes of the desired other data, in a somehow-corresponding place in the other. For example, the $N$th elements in both subarrays will both have index $N$, so it's not too complicated to figure it out from one what is the other. (That's called corr for "Correspondent communication".) Another, mirr for "Mirror" pairs up the $N$th in this sub-array with the length$-N$th in the other, and vice versa, in both cases the "other" is known for any array's Nth element without asking anywhere else for outside data.
Then the communication function builds our tuple. It takes the local array element as the first element in the tuple, and it fetches from the other array the indexed element whose index was generated by the index generating function. So both right and left sub-arrays get manipulated into new structures, now they become arrays of tuples.
Does it seem as if we could wait? No! We want all the communication to be handled at once, and separately from the local calculations. We create nice tidy separate workspaces with everything needed in a nice tuple of operating supplies, so then the local operation task can be clean and simple.
At this point we have specified divide, combine, and communicate. Divide and combine are functions given separately to PDF(), while communicate might be just a compositional part of the preadjust function (the first-applied part), if it is followed by a local computation function to be carried out in the building up of the computation graph.
The postadjust function might also include communication as well as local operation.
See, parallel processing builds up a computation graph which is a tree. The tree at the top is a single node that includes the whole giant problem as one bite, but at the leafs of the tree the problem is usually too trivial to even do anything. You can often swallow without even chewing once!
You might think (I confuse myself by thinking) the tree does calculations at each node thereby doing a recursive breakdown or buildup of results from smaller parts of the tree, but actually those calculations are not so much done at the node, as during the building out of the tree, or during the summarizing or winding back up of the tree. The calculations (except at the very ends, the leafs of the tree), are always and only going to be done on the way up or on the way down. Once again, there will be no separate traversing the tree to do operations at each node outside the preadjust or postadjust processes which themselves build out the branches from the nodes, or collapse the branches into the nodes, respectively.
Finally, how do we go about understanding and defining for our local operation? Let's take those separately.
Understand the local operation by recognizing

that it will be a function which takes a tuple (or you might say list of arguments) as inputs;

that it cannot access anything outside itself and its inputs, so you could put constants in it for your math to work out, or if you need something else then you have to build that into your original or communication-expanded data before you even get to the local operation;

that it will be distributively (separately, simultaneously, on separate processors) applied to all the elements in an array (of tuples), one process and one copy of the function operating on one tuple;

that its output will be a single element, perhaps a 1-tuple, with the type and structure of an element in the original (top-level-input) data array on the way from root to leaf, or with the type and structure of the ultimate answer in the final (top-level-output) data;

that all known local operation functions are goddamn simple. If it takes you more than a minute to write one, you can be sure you're be wrong. As George says, things are either obvious or impossible. So keep thinking until you see the simple essence of how the function is built out of itself.

That should help you.
Next, defining your local operation. This will be the nub of the idea of the parallel decomposition. If you stare at your equation, you'll find a repeating pattern which allows the problem to be broken down into multiple smaller instances of the same problem. How those instances return a result which is combined to make the larger solution, that combination work is the work of the local operation.
Actually, you really don't have to solve the problem, at all, just see that it contains a simpler or smaller version of itself.
In FFT it's shift (multiply by a constant) the sum of the odds and add to the sum of the evens. In Reverse it's delete self accept other (referring to the 2 elements of a communicated 2-tuple).
I forgot to mention, anything that is needed that is NOT part of the (variable-content) array can be specified as a constant in the local operation function itself. Everything else must come from the input array.
basepred and basefunc are discussed here. So that's everything. Structure manipulation in the divide and combine functions; orthogonal to communication which is the first applying part of the preadjust or postadjust functions; orthogonal to local operation which is the second-applying part of the preadjust or postadjust functions.

Bibliography
John Backus 1977 Turing Lecture.
H. Burkhardt,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.
H. Burkhardt,"Trigonometrische Reihenund Integrale(bis etwa 1850)," in Chap- ter 12, Encyklopiidie der Mathematischen Wissenschaften,vol. 2, part 1, 2nd half, 1904-1916.
J. W. Cooley & J. W. Tukey, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. Comput., Vol. 19, no. 2, pp. 297-301, Reprinted in Digital SignalProcessing,ed.L. R.RABINER& C. M. RADER,pp.223-227,New York: IEEE Press,1972,Apr. 1965.
C. F. Gauss, "Nachlass, Theoria Interpolationis Methodo Nova Tractata," in Carl Friedrich Gauss Werke, Band 3, Koniglichen Gesellschaft der Wissenschaften: Gottingen, pp. 265-330, 1866.
Michael T. Heideman, Don H. Johnson, C. Sidney Burrus "Gauss and the History of the Fast Fourier Transform", 1985. Rice U Eng. Dept.
Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication Viterbi Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel. ArXiv 2304.02336v1 [cs.AR] 5 Apr 2023. Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing Lecture. (ACM V 14 No 1, 1967).

Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795. Z. George Mou, and Xiaojing Wang, Optimal Mappings of m Dimensional FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp 104-119. in PARLE '93, Parallel Architectures and Languages Europe, Lecture Notes in Computer Science #694, Springer-Verlag. 5th International PARLE Conference, Munich, Germany, June 1993 Proceedings.

dc.tut.php: DC tutorial
dc.php My DC homework as math.
fft.intro.php: FFT introduction
fib.dc.php: Fibonacci in log N time
dc.future.php: Future work

	Your thoughts?
Name⁺:
Email^*:	(will not be shared or abused)
Comment:
	Feedback is welcome.

Divacon Tutorial

TL;DR

Introduction

Inspiration

A Language For Parallelism

Parallel Forms

Decomposition: How to Think Clearly About Computing

On Communication

Bibliography

Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: June 13, 2024; Modified June 17, 2024

TL;DR

Introduction

Inspiration

A Language For Parallelism

Parallel Forms

Decomposition: How to Think Clearly About Computing

On Communication

Bibliography

Copyright © 2024 Thomas C. Veatch. All rights reserved. Created: June 13, 2024; Modified June 17, 2024

Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: June 13, 2024; Modified June 17, 2024