TL;DR
Divacon Parallel Algorithm Design:
Your problem is to write a function that does something with an array.
(Is it even possible for parallel computation to be anything else?)
To solve any computing problem, you will necessarily make use of three
orthogonal constituents, none of which includes anything of the
others.
- Local: local calculations (below: \(f_b, preop, postop\)).
- Co: Communication which specifies what elements talk with what other elements (below: \(precomm, postcomm\)).
- SM: Structure Manipulation (below: \(d,c\) which divide and combine arrays of data).
To understand your problem is to break your task into these
constituents. To build the best parallel algorithm, hang those parts
where they belong on the general, recursive parallel computation
scaffold, which is the Divacon equation:
$$\begin{align}
f\ A\ =\ & base?\ f_b\ \\
& else\ \ \ \ c\ :\ post\ :\ !f\ :\ pre\ :\ d\ \ A. \\
\end{align}$$
That defines what \(f\) does. And this is how to create \(f\):
$$\begin{align}
f\ =\ & PDC(d,c,pre,post,base?,f_b) \\
\end{align}$$
Where:
- \(f\) is your function, \(A\) is the data array it operates upon.
- ":" means function composition. \(f : g\ x = f(g(x))\); we don't like unnecessary parentheses.
- \(f\) contains \(f\), means this is a recursive definition.
- \(!f\) means apply \(f\) separately to any number of inputs, that is, if you want, in parallel.
- \(f =\ ..!f..\) produces a tree structured computation
graph: each (parent) instance of \(f\) has some number of daughter
instances \(!f\).
Here \(f\) is more an organizing principle of a control pattern
than what you might normally think of as a meaningful
self-contained function. All the work in \(f\) is done by the
parts of \(f\), always outside the included \(f\)'s, in what
we call \(c, d, pre, post\) parts -- except for the tiniest trivial bit (\(f_b\)) at
the base of the division tree. The point of \(f\) is merely that
it is recursive and parallel, and otherwise it does nothing but
orchestrate the sequence, the subdivision, the mutual dependencies of
its parts.
- \(c\) combines once for each time \(d\) divides. After \(d\) and
before \(c\) we have data parallelism: each subdivision of the
data can be operated on separately.
- \(f_b\), the base function, converts a data value from input type
to output type once division is down to a single data value.
- \(pre = preop : precomm\ \ \ \) and \(\ \ \ post = postop : postcomm\ \ \ \ \ \)
Here, communication feeds local operations in both cases.
- \(preop, postop\) prepare for or recover from the recursive calls
\(!f\). Connecting \(d\) to \(f\), or \(f\) to \(c\), these
enable separated sub-arrays to be separately fed to instances of \(f\), or enable sets
of \(f\) outputs to be combined so as to have the form and
meaning of the function's output. These do local calculations only, isolated from any
communication or division or combination. Each converts data of
the input or output type from one level to another but retains the
same general data type on the input or output sides, respectively.
- \(precomm, postcomm\) do communication only, to bring in any
additional data needed so that local calculations can be local.
Working right to left in the formula, and as you recursively break
down the inputs, take the array, and apply division until you can't
divide any more. After that, apply combination until you can't
combine any more. On this general scaffold, you need three things to
happen. At the limit of division, you need a function to convert from
input data type to output data type (it might just be a conceptual
change of units not always a numerical change of values). On the way
up or down or both, you may need a local calculation to be done at
each level to prepare the inputs for separate application of \(f\), or
to postprocess the outcomes so their combination will be a valid
function output, within each level. Those local calculations will
depend on local values, certainly, but also perhaps on values that
need to be communicated from elsewhere in this or another array.
Those dependencies determine what data is needed locally to do the
calculation, and what needs to be communicated from elsewhere.
Once you get clarity on the Local-Co-SM breakdown of your problem, it
is easy to express it in Divacon, and you now have a language
and methodology for parallel algorithm design. You will find it
enables optimal and concise results.
Okay that was your quick review. If you have a spark burning, now,
read on.
Introduction
These ideas, approach, formalization, notation, and system, come to me
from the great George Mou, whom I am lucky and proud to know as a
friend.
George's work in the math of algorithm parallelization using his
elegant and powerful Divide and Conquer formalization, Divacon,
is both revolutionary and generally unknown. For example, today's
version of the Wikipedia entry for Divide-And-Conquer parallelization
fails to even mention George's work. The Divacon approach
enables miraculously clear thinking.
Let's not be like that. Properly educated, we could build unbelieveably simple, provably
correct, highly optimized parallelizations of important as well as new
algorithms. So let's get educated!
Forgive me for a bit of a rant. Because this is the exact kind of BS
that has happened repeatedly in the past in the very subject I'm going
to work on as my homework with you here, the FFT. Did you know that actually it was
Gauss who invented the Fourier Transform before Fourier, Gauss who further
invented the discrete FT, Gauss who further invented the Fast FT just like
we use today, in 1805, all at once when Fourier was doing the first
bit in 1807. But noone paid attention to it until Cooley and Tukey
reinvented it in 1965. Even after that it was further discovered that
not just one or two or several but many other reinventions of the same
ideas had been forgotten -- including a review of all the
forgotten reinventions, saying how terrible it was that they
were forgotten -- which itself was forgotten for 60 more
years! Here's a review of that reviewer:
"A second reference to Gauss' algorithm ... was written by
H. Burkhardt in 1904. Burkhardt comments that Gauss' method was
general, but seemingly not known by practitioners. It is interesting
to note that Goldstine's and Burkhardt's works went almost as
unnoticed as Gauss' work itself." (Heidemann, Johnson, & Burrus, 1985)
It may be without irony that George tells me he thinks the world may
be ready for this in 200 years. No, it would be typical. (Am I
sensitive?)
When I see this it kills me. It is on us to capture and use
these achievements, to go through these doorways of genius. If we
forget it like Gauss's
notebooks or Danielson and Lanczos (1942) or the many others, then
it is us, we, the dummies who should regret it, and after folks
rediscover it later, we will be ashamed. Our age will be thought of
with condescension, like those Aztecs or Incas who had wheeled toys
but never built wheeled vehicles: so close and
yet so far. Worse in our case, because our new powerful ideas have
actually been built and used, but we still ignored it.
So let's actually not!
Let's push it forward, and bring it into our
understanding and our practice.
In service of this, my small contribution here: motivation,
explanation, tutorial of how to think about parallel problems, a
little more verbose redundancy to make it easier to read than George's
terse mathematics. I hope this brings George the credit he
deserves, sooner than 200 years hence.
Inspiration
George Mou's 1990 Yale Computer
Science PhD thesis was written in the Golden Age of parallel programming
which was going on at Yale in the late 1980's and early 1990's.
In that golden age
Alan
Perlis was alive, saying things like
- "algorithms often fail to exhibit a clarity of structure which
they really possess."
- "While I have spoken glibly about variability attached to the
identifiers of the language, I have said nothing about the
variability of control. We do not really have a way of
describing control, so we cannot declare its regimes. We should
expect our successor to have the kinds of control that ALGOL has
-- and more. Parallel operation is one kind of control about
which much study is being done."(p 8)
- "A natural technique to employ in the use of definitions is to
start with a language X, consider the definitions as enlarging
the syntax to that of a language X' and give the evaluation rules
as a reduction process which reduces any text in X' to an
equivalent one in X."(p 9)
- "Programmers should never be satisfied with languages which
permit them to program everything, but to program nothing of
interest easily."(p9)
Also remember how Backus
admonished us of the limits of von Neuman architecture
(1-word-at-a-time communication, implied especially by the assignment
operator), "Conventional programming languages .. Inherent defects
.. inability to effectively use powerful combining forms for building
new programs from existing ones, and their lack of useful mathematical
properties for reasoning about programs. .. [versus] an algebra of
programs whose variables range over programs and whose operations are
combining forms. ... [and for example]
A Functional Program for Inner Product
Def Innerproduct = (Insert +) o (ApplyToAll *) o Transpose
[where] Composition(o), Insert, and ApplyToAll (a) are functional
forms that combine existing functions to form new ones...
"(ACM 1977 Turing Lecture)
And we can't forget the support of a highly distinguished dissertation
adviser and committee, Paul Hudak, David Keyes, and Bill Gropp.
(Just google those names to track the progress of parallel processing over
3 decades.)
Drum roll, please...
Alan Perlis said to George Mou, "Divacon is the golden key to the future of computer science".
Evidently Divacon parallelization is powerful. Its elements are
simple, and its way of thinking is simple. Let's empower ourselves by learning and
practicing it.
A Language For Parallelism
This conceptual approach, language, notation are from George Mou's
1990 Yale Computer Science PhD thesis.
We will try to get grounded in George's "DC" or
Divide-and-Conquer or "Divacon" ideas and formalism, which give a
mathematical form and elegance to parallel decomposition of
algorithms.
Incidentally, Divacon notation has been made executeable several
times, not just with the Connection Machine of the 1980's and 90's, so
let us recognize it is a grounded reality, not just a powerful tool of
hyperabstract thought. After this DC language training, I'll apply
this approach to the Fourier transform, as an exercise.
Before we begin, let me give ourselves all an attitude adjustment.
What we are embarking on here is the understanding of how to break
down and solve big problems. Like, to process an array of 10^10
elements, would be an example. So all the definitions and concepts
and operations we are going to go over, you should think of them as
being not just simple and definite and understandable, but also giving
you the power and the leverage and the understanding at scale, for
problems at any scale. So yes, please, first accept these ideas
simply into your mind, they are stated simply enough, but also
recognize, feel their generality and how they might apply to big
things as well as small things, and if you recognize that, then you may feel
the power that is thrumming under your fingertips as you bring
definitions, forms, and components into your mental models and the
working spaces of your mind. Their simplicity means that you can put a
big harness over a big problem and control it fully by simple
reasoning. Okay? Okay.
To begin, we will be needing some math-like notation and terms.
The conventions below are taken from George's thesis.
- (): Skip parentheses in function application. Write f(x) more cleanly as f x.
- (..) means construct an object whether tuple or vector or recursive array or etc. from the ..-enumerated elements.
- A tuple is a grouping data structure comprising subitems which may not be of identical structure.
- A vector is a tuple constrained to comprise subitems of
identical structure, and to be indexed by a (binary) integer, or
by an index set which pulls out any number of its elements.
- A recursive array is a smart vector, that is, which knows its length and both can be divided (until atom? returns True) and can be combined.
- Normalized indexes count from 0 to N-1. This lets you think clearly by not having to think about indexes, a great source of noise, confusion, and spaghetti code.
- Relative indexes are normalized by some 1-to-1 mapping then inverse-normalized by the inverse mapping.
So a (sub)vector might begin its life unnormalized, then become normalized, then be
operated upon while normalized, which makes the algorithm simple, then
be inverse-normalized back to its original indexing. Separating the
calculations from the indexing struggles makes things nice and simple.
George likes to distinguish between communications versus operations.
Getting the data you need, and doing something with it, should be
separate concerns. Indeed, each has nothing intrinsically to do with
the other. So we do relative indexing so as to forget about where the
data came from and what indexes it may have had in that location, so
that our operations can be simple and pure and local.
Parallel Forms
Next, some combining forms:
- :: function composition. f : g means first apply g, then apply f to the output of g.
- !: distributive application. \(!\ f\ (x_0,..x_{n-1})\ =\
(f\ x_0,..\ f\ x_{n-1})\) or rarely \(!\ (f_0,\ ...\ f_{n-1})\
(x_0,..x_{n-1})\ =\ (f_0\ x_0,..\ f_{n-1}\ x_{n-1})\) if there
are different f's.
\(!\ f\ X\) is the essence of parallel decomposition.
The symbol \(!\) here says apply f to each of the separate
elements of X. It doesn't need them to be done in any particular
order, and in particular they can be done separately and all at
once, that is to say by parallel processors, if you have them
available to do the work, or not, if not. (We are doing math
here; implementation is a separate concern and typically there is
a forest of implementation possibilities which initially would
only distract our minds from the simple essences. So do this as
math.)
If you can take apart your algorithm so that it contains itself
not just recursively but inside a distribution, as in \(f\ =\
...\ !\ f\ ...\) then you have a parallel program.
And some (standard abstract) math won't hurt us:
- A relation, say \(r,R; .,\)_; \(+,*\), etc. takes some input(s) and yields some output(s).
- A morphism is a function which can be done in either of two domains. So
f is a morphism if \(f ( x_1 \) r \( x_2 ) = (f x_1) \) R \( (f x_2) \).
That is, operate on your inputs in the r domain, then apply f to the output, or do f on the inputs, and operate in the R domain;
either way you get the same thing.
Here's one. Do you have your logarithm concepts and rules in
memory? If not, go look them up again, they are cool. If I say this:
\(\ log\ :\ *\ =\ +\ :\ !\ log\)
Then do you recognize the ":" of function composition? Do I need to
write the vector of inputs (2 will do) to which a function
applies, as follows?
log(product(\(x_1,x_2\))) =
sum(log(\(x_1\)),log(\(x_2\))))?
I'd like to remove the messy
and confusing parentheses: log : prod X = sum : ! log X,
and let the function apply to all the members of the vector.
So that's a bit of notation.
But the meaning is there too. Remember from calculus that the
log of a product is the sum of the logs. Right? For example,
\(log(a*b*c)=log(a)+log(b)+log(c)\). Morphisms are helpful if it
is easier to go do something in a different domain and then come
back to the same result.
This is getting close to \(f\ A\ =\ c\ :\ !\ f\ :\ d\ A\) which
would say that \(f\) on arrays is (can safely be defined as) a
morphism of itself on divided sub-arrays that are then combined.
That would be a mathematical proof for a parallel decomposition
of a function.
Finally the simplest of all things, \(id\):
- id means the identity function: its output is identical to its input.
Was it simple enough? Maybe you're stuck on morphisms, but with ! and
:, can you see how \(f\ =\ c\ :\ !\ f\ :\ d\) would be a parallel decomposition?
That's all you really need.
Decomposition: How to Think Clearly About Computing
George's favorite lesson: Don't mix up how you will do something with
what you are going to do, Tom!
But his primary teaching point in PDC is
this, that computer algorithms are built from three orthogonal
components: operation or calculation, communication, and structure
manipulation. These are separate; they should be kept separate; and
since they are orthogonal they can be specified separately.
Further, mixing them produces confusion. Further, if you merely
specify these three clearly, you've already done the job of parallel
programming. More than just describing some qualities of the program,
you will have already written it. Let me say it a few times.
Understanding computing deeply, which means simply, means recognizing
the orthogonality of communication, structure manipulation, and local
operations. When you communicate data, you aren't manipulating
structure, and you're not doing local calculations. When you create
and modify structures, you aren't communicating, and you aren't
calculating. And when you are doing local calculations, you aren't
communicating or modifying structures. If you can be clear about
this, things fall into place very easily. Sadly, if not, not -- not
in the world of parallel algorithm design.
On the other hand, the mere decomposition actually specifies, actually
writes, a Divacon algorithm and a Divacon program. So let's try to be
extremely clear about what these orthogonal dimensions are, and let's
try to develop a fastidious intuition when we are specifying one of
them about not mixing it up with either of the others. If we can do
that, it will turn out that we can eat elephants for breakfast.
- At the bottom of the analysis tree, and in various
algorithms on the way up or down or both also, one actually
does some calculations with the data that has been placed on
your plate to make new values. That's operation.
- Before the operation, the patient must be on the table.
Deciding what parts to get, that's communication.
- During a war, the soldiers must be arranged into groups of different sizes
by dividing up or combining them together. That is structure
manipulation.
George seems to say, if you will just get your mind clear on what is being done in these
three orthogonal dimensions, your work is basically done for you.
Let's go through them twice more, and see if we can be clear minded
about them then.
- Local operations a.k.a. "universal functions" (p 53)
"Computations over arrays also need operations of an
orthogonal nature, that is, operations that map array entries
to new values."
- Communication can be understood as having two
phases. The second phase is to go get that needed data and
put a copy of it here, in or adjacent to the thing that
needs it. But getting what's needed depends on the key
first phase which picks out the indexes of what will be
needed in the local operations.
In the expression \(\#\ !\ f\) this will be the \(f\) (for
"find", not for "function", and also it's not "get", \(f\)
just says where what you want is to be found). Then \(\#\)
says to actually get it, and make a 2-tuple with it and the
thing you've been sent from. And "!" says do the same
(opposite) communication to both sides). See below for
details and examples.
- Structure manipulation (a.k.a. relational functions,
p53) exists in a couple of forms: tuple creation as part of
communication, and the divide and combine methods.
First, given a communication generator such as \(f\)
(we'll be using corr, mirr, br, mainly), the
functional forms "\(\#\ f\ x\)" (and "\(\#\ !\ f\ X\)")
construct a new 2-tuple for each element \(x\) (or element of
\(X\)), to contain the elements picked out by \(f\).
Like, if your system wants to add this and that, it'll apply
\(+\) to the tuple (this,that), which was created by \(\#\
f\ index\_of\_this\). Like this:
\(\ +\ :\ (\#\ !\ f)\ \vec{i}_{these}\)
Are you following? Summation applies to the pairs whose
first element is one of \(X_{i}\) and whose second
element is whatever \(f\) aims you at, when presented with one of
\(\vec{i}_{these}\), such as the corresponding element in another
sub-array (if f = corr), or the opposite/mirroring element
(mirr), or all the elements (br) there.
Second, the divide and combine methods \(c : d = d^{-1} : d
= id \) take arrays and return divided or combined arrays,
without modifying their contents. Thus no local operations
are part of \(d\) or \(c\), since they don't change any
array values. They do change the indexing schemes
for the divided or combined arrays, so as to count in order
from \(0\) to \(N-1\) where the length of the output array
is \(N\).
It is preferred (for reasoning to be made either simpler, or possible) for \(d\) and
\(c\) to be independent of the data. The typical ones are
\(d_{eo}, d_{lr}\) with \(d_{eo}^{-1}=c_{eo}\), and
\(d_{lr}^{-1}=c_{lr}\) for Even-Odd and Left-Right divide and combine,
but you may also encounter
\(d_{ht}\) with \(d_{ht}^{-1}=c_{ht}\) being unbalanced head-tail
division and combination.
It's a lot. But I've been through it a few times and it is starting
to make sense, so I encourage you.
Now, once the three orthogonal dimensions of computing are clarified,
writing a Divacon program basically reduces to checking 6 boxes on a
checklist. All you have to is just: define the divide and combine
functions which manipulate (divide and combine) structures, the
communication function which specifies what other data from where will
be needed to do a computation, and the local computation which can
finally be carried out when the data is present. Actually these are
easy to separately define, and systematically used in the Divacon
structure, which makes the work so easy and quick that George tells
me:
Tom, if it takes longer than one minute, then your answer is
wrong. \(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \)George Mou
Here's the checklist. Divacon defines a higher-order-function
generator, PDC(), which accepts six functions, and returns a parallel
program implementing the algorithm embodied in those functions.
Define and apply a PDC as follows:
\(pdc\ =\ PDC(\)divide, combine, preadjust, postadjust, basepred, basefunc\()\)
\(Y\ =\ pdc\ X\)
A PDC so defined is equivalent to the following
expression, which constitutes a general if not exhaustive theory
of parallel algorithm decomposition.
Divacon: | |
\(pdc = (\)basepred ? | \(then\) basefunc |
| \(else\) combine : postadjust : \(\ !\ pdc\ \) : preadjust : divide ) |
I find myself repeatedly studying this, and it is getting more intuitive.
The adjust functions are in general three phases:
1) communication specification,
followed by 2) communication, followed by 3) local operation. They
preserve the data domain they are operating within, so for
example an FFT on the preadjust and divide side is all
rearranging sampled data real numbers representing the time
domain signal, while the postadjust and combine side is all
operating on frequencies which are complex numbers representing
amplitude and phase for each different frequency.
Also, look suspiciously at an adjust function which does not emit
the same kind of data structures it consumes. It is not a
structure manipulation function!
To create a Divacon algorithm the system creator specifies 6
things (most of which are typically either predefined since
generally useful, or quick work once you think in these terms and
with this kind of vocabulary). The creator then lists the six as
arguments to the PDC higher-order-function (a function that
creates a function), and runs the returned function on some data
X to yield some output Y:
It may help to draw out a computation graph such as on
p79,
and track what happens where so you can see what everything
means. You should see how recursive definition works, how divide and
combine operate on the tree, where basepred applies and what happens then,
and finally how the adjustment functions work and where they apply exactly.
Then if you can figure out communication, you'll be ahead of me.
Now, let's review the 6 with reference to the 3 orthogonal dimensions
of computation.
But first, basepred. This is used to detect leafs of the
computation graph, at which divide has done its job and no
longer applies. basepred does none of the three
orthogonal tasks of computing, neither creating new values, nor
figuring out what to fetch, nor modifying structure. It triggers
basefunc, is all; and then it would be basefunc
that does a local operation.
Structure modification is carried out within divide and combine.
Only balanced divide and combine operations yield benefits of
parallel programming; \(d_{ht}\) etc. will be as slow as any
sequential program, stepping one by one through the data array or list.
In contrast, effective parallel decomposition gains its power by
splitting giant things into smaller and repeatedly smaller
things, which can all be handled at the same time in parallel.
Then you might still have a million multiply-adds, for example,
but they might take a single time step, given a million
processors. Instead of eating an elephant one bite at a time
(sequentially), a million, little, themselves-edible piranhas
each simultaneously take their separate bites all at once, and
eat the whole thing all at once, except that then their bigger
and bigger cannibal brothers eat two then the two that ate two,
then ... finally, the whole digested elephant inside two giant
beta piranhas, which are eaten themselves in a single monstrous
bite by the alpha piranha of them all. This is the miracle of
parallel processing. (I know, elephants don't live in the Amazon!
It's a metaphor, silly.)
So to get back to divide and combine, all that those should do is
take an array and split it into N sub-arrays, or vice versa; they
do no local operations; also unlike algorithms like
quicksort they don't care about the contents of the
arrays, which makes it much easier, again, to reason about what
they do. If it's nice and clean like that, then you can indeed
say and be satisfied that \(c\ d\ =\ id\). Then you can divide
up, and eat up, the whole elephant, in nice tiny amenable tasty
cubes at the lowest level, but in fully confident and no more
difficult or complicated, enormous bites at the top. Because
it's the same divide and combine at every level, and they always
match perfectly with the other. Because \(c\ d\ =\ id\).
A combine task, for example, might include the denormalizing of
the second of two vectors upon combining them into one. In
this case, it adds subvector length N to all the indexes of the
vector that is being denormalized, in order to sequentially
combine it with another. Thus vectors indexed with [1 2]- and
[1 2]- could be combined to a [1 2 3 4]-indexed vector.
(Isn't that a case of creating new values, which is a local
operation? Or do we make an exception for the mechanics of this
"relative indexing" system, and just hide those index
adjustments?)
Local operations can be carried out within pre-adjust, post-adjust, and basefunc.
I once thought I wanted to put a multiply-add inside the combine
operation: Mistake! That confuses matters; so instead the better
approach is to put it into a postadjust function, or a preadjust
in the next layer up. basefunc is often simply id,
and all the actual work is done on the way up or down.
Communication is specified by corr, mirr, br,
etc. index generators, or I might think of them as "finders",
\(f\), and then used in structure modification in the parallel form
\(\#\ !\ f\), which may be part of the adjustment or base functions.
In parallel divide and conquer we need a base predicate to tell
us True when we are finally done dividing the problem into
sub-sub-problems. operating at the base level of the division
tree, the leaf level, and a base function to do that operation at
the base level.
"[T]he simplest and most intuitive (base predicate and base
function) are atom? and id respectively."(4.2.3,
p77)
Then simplest leaf operation would be: if (atom?) then return id;
This is what is specified by PDC(... atom?, id).
(As a bit of a miraculous example in the case of the FFT,
basefunc=id because the 0th angle step is \(\theta=0\) and
\(X * e^{i0} = X * 1 = X\).)
On Communication
# almost means Communication, a concept I struggled
with. The following notes are copied or summarized from section 3.6.3
p56ff.
Let \(f\) be a finder or communication generator mapping index
sets to index sets (not always one to one; for one it might want
to bring in more than one).
Finder \(f\) will know enough about where it is to figure out
where other needed stuff also is located, typically the low bits
of the current index, which it can use to refer over to a
corresponding element, for example, in a different section of the
same array, or a corresponding section of an adjacent array but
which shares the same low bits in its index. In that case it
doesn't even have to ask anyone where to look for something, it
already knows because it knows where it is, itself.
So given an indexable array \(A\) we have \(A : \vec{i}\) which is the
indexed subset of \(A\) and we also can have \(A : f\ \vec{i}\) which is the
subset of A that the mapped-to indexes pull out, when the index set \(\vec{i}\)
gets mapped to some other index set \(f\ \vec{i}\). So that's another subset of A.
Then we can define \(\#\ f\) as:
\(\#\ f\ A\ =\ A'\), where \( A'\ \vec{i}\ =\ ( A\ \vec{i},\ A\ :\ f\ \vec{i}) \)
that is, get an extra copy of the \(\vec{i}\)-indexed values of
A, namely \(A\ \vec{i}\), bring them in to a new tuple as its
first element, but then also bring in the \(f\)-modified-index
values from A, put them together into a 2- or N-tuple, and now you have
the other data that you may need later all in one place for the
operation you'll be doing later. In the context of such pairs,
it is convenient to define:
\(self\ (x,y) = x\) |
\(other\ (x,y) = y\) |
George says, "The operator # thus is a higher order function that
transforms a function over index sets to a function over arrays
that performs communication. " (This becomes more clear when you
recognize that corr operates (typically) on a pair of data
arrays, and specifies for each element that the corresponding
element in the other array is just the one in the same position.)
George continues: "A communication function maps each array entry
to a pair, where the first component is the entry's original
value, and the second is the value of another entry whose index
is determined by applying the generator to the entry's own
index." (So corr, mirr etc. operate on pairs of arrays by
working on each of their entries, positionally, and copying the
other array's position-matching element to come in here with this
element in a tuple, which provides what is needed for local
computation. Note that this is not considered to be structure manipulation
and certainly not local computation: it is communication.
You'll see \(\#\ !\ f\) a lot, which also uses the distribution symbol \(!\) to say,
apply \(f\) separately to each of the elements of the following array.
So there are several standard communication generators, and you
can read the
thesis
(section 3.6.3) for several more. So \(f\) is commonly:
\(corr\ \vec{i} = \vec{i}\) | correspondent communication (the index of the corresponding element in an adjacent array will be the same,
or have the same low bits if the two arrays are merely conceptual and they actually occur within a larger including array). |
\(mirr\ \vec{i} = $ - \vec{i}\) | mirror-image communication ($ is N-1 for array size N). |
\(br\ \vec{c}\ \vec{i} = \vec{c}\) | broadcast from one entry to all. |
\(last\ \vec{c}\ \vec{i} = $ - \vec{c}\) | broadcast from an entry with backwards index. |
\(null = nil = id\ \ \ \ \) | Use this in a tuple of generators to capture the corresponding tuple of indexes. |
If it's worth your doing a bit of homework, work through the examples
on p59 so you understand them.
If you're still struggling maybe my additional notes on Communication in Mou's thesis will help.
The six functions are mostly trivial or one selection from a small set of canonical
functions. Their combination is to be understood as follows.
The ":" composition operator operates right to left: \((f:g)(x) = f(g(x))\).
So the first function in the above chain is the divide function.
Having divided a bunch of stuff typically in half, we then apply to each half
the next function in the chain: preadjust. Preadjust is going to be a "communication function"
which specifies, and then brings in, what other data is needed to do this computation.
The local computation which does the work will be needing all its
relevant data to be local, so that means we need to capture
everything it needs and bring it in. In particular there will be
a tuple containing everything the local operation requires.
Further, there will be an array of such tuples, because to do a
parallel process means to identically do the same in a bunch of
parallel processors, and to do that we need to provide the same
kinds of data, whatever the process needs, namely some tuple of
all the required data elements, to each of all the parallel
processors, which means we need an array of those tuples.
Then we can send one of the tuples to each of the processors, and
it will be able to do its work separately from all the other
processors, which means it can be done in parallel, which is the point.
Whereas if you need to separately think about what you need in
the midst of the working of each parallel process, and then copy
over what it requires from wherever it is, well now you're
talking spaghetti code, and we don't tolerate such poor hygeine
here.
Although this was also discussed here
the following may be helpful.
divide, and combine are structure
manipulation functions: they take an array, the input data,
and restructure it. \(d_{eo}\) divides a single input array into
two, taking evens to the left and odds to the right; while
\(c_{eo}\) combines two input arrays into one, shuffling them
into even and odd positions in the output. This has nothing to
do with going elsewhere to get any data (communication), or doing
anything to the data (local calculations): it is orthogonal. But
once you've specified your divide, and combine
functions, you have a whole structure up and down the parallel
computation graph which makes life simpler for the other tasks.
Similarly communication and local calculation are orthogonally
specifiable, and better so.
Computation resides in two forms, specifying what to get, and
getting it. But the more basic idea is to think about this in
the context of parallel programming. In a parallel program,
there is going to be some data which every processor will be
separately processing, because to be parallel we want them to be
working independently. This means we would like that data to be
entirely local to each process, so that it can just do its local
calculations without interrupting itself to go fetch data from
some other place. This means all the data it needs must be
provided in a data structure fed to it, separately. This means
we want an array of tuples, one for each subprocess, each to
contain all the data required by any instance of the
subprocess.. Then each one can proceed on its own, which is
what makes it a parallel program.
Secondly, consider the interrelationships among the data of a
parallel process. Is it not always and necessarily a bunch of
data? If it's a bunch of data can we not always go through it
and make an array of it all? Of course, yes. Having made an
array of it, and by assumption it's all the data our parallel
process will be using, isn't there some potential dependency
between the different elements of the data? You may need,
for processing this part of the array, some data from some corresponding
other part of the array? That's the idea here.
Let's put it all together:
pdc = combine : postadjust : ! pdc : preadjust : divide
Reading backwards (that is, in compositional order from right to
left), the first, divide function takes the input which is an
array of data, and divides it typically in half (but dividing in 1/3 or 1/N's is
okay too).
Now it is tuple-building time, so that we can run the process
separately on its collected required data elements, stuffed into
one tuple for each process. The preadjust function builds the
tuple, first by specifying what is the one corresponding element
(or more) which needs to be brought in here for the local
computation, that is, by an index generating function. Second, by copying the index'ed other data
from the other sub-array in to this element's tuple, by a
communication function.
So the index generating function takes the normalized index
within our subarray (the index of "this" tuple), and somehow
calculates the index or indexes of the desired other data, in a
somehow-corresponding place in the other. For example, the
\(N\)th elements in both subarrays will both have
index \(N\), so it's not too complicated to figure it out from one what is the other.
(That's called corr for "Correspondent communication".)
Another, mirr for "Mirror" pairs up the \(N\)th in this
sub-array with the length\(-N\)th in the other, and vice
versa, in both cases the "other" is known for any array's Nth
element without asking anywhere else for outside data.
Then the communication function builds our tuple. It takes the
local array element as the first element in the tuple, and it
fetches from the other array the indexed element whose index was
generated by the index generating function. So both right and
left sub-arrays get manipulated into new structures, now they
become arrays of tuples.
Does it seem as if we could wait? No! We want all the
communication to be handled at once, and separately from the
local calculations. We create nice tidy separate workspaces with
everything needed in a nice tuple of operating supplies, so then
the local operation task can be clean and simple.
At this point we have specified divide, combine,
and communicate. Divide and combine are
functions given separately to PDF(), while communicate
might be just a compositional part of the preadjust function (the
first-applied part), if it is followed by a local computation
function to be carried out in the building up of the computation
graph.
The postadjust function might also include communication
as well as local operation.
See, parallel processing builds up a computation graph which is a
tree. The tree at the top is a single node that includes the
whole giant problem as one bite, but at the leafs of the tree the
problem is usually too trivial to even do anything. You can
often swallow without even chewing once!
You might think (I confuse myself by thinking) the tree does
calculations at each node thereby doing a recursive breakdown or
buildup of results from smaller parts of the tree, but actually
those calculations are not so much done at the node, as
during the building out of the tree, or during the
summarizing or winding back up of the tree. The calculations
(except at the very ends, the leafs of the tree), are always and
only going to be done on the way up or on the way down. Once
again, there will be no separate traversing the tree to do
operations at each node outside the preadjust or postadjust
processes which themselves build out the branches from the
nodes, or collapse the branches into the nodes, respectively.
Finally, how do we go about understanding and defining for our
local operation? Let's take those separately.
Understand the local operation by recognizing
- that it will be a function which takes a tuple (or you
might say list of arguments) as inputs;
- that it cannot access anything outside itself and its inputs,
so you could put constants in it for your math to work out,
or if you need something else then you have to build that into
your original or communication-expanded data before you even get to
the local operation;
- that it will be distributively (separately, simultaneously,
on separate processors) applied to all the elements in an
array (of tuples), one process and one copy of the function
operating on one tuple;
- that its output will be a single element, perhaps a
1-tuple, with the type and structure of an element in the
original (top-level-input) data array on the way from root
to leaf, or with the type and structure of the ultimate
answer in the final (top-level-output) data;
- that all known local operation functions are goddamn simple.
If it takes you more than a minute to write one, you can be sure you're be wrong.
As George says, things are either obvious or impossible. So keep
thinking until you see the simple essence of how the function is built out of itself.
That should help you.
Next, defining your local operation. This will be the nub of the
idea of the parallel decomposition. If you stare at your
equation, you'll find a repeating pattern which allows the
problem to be broken down into multiple smaller instances of the
same problem. How those instances return a result which is
combined to make the larger solution, that combination work is
the work of the local operation.
Actually, you really don't have to solve the problem, at all,
just see that it contains a simpler or smaller version of
itself.
In FFT it's shift (multiply by a constant) the sum of the odds
and add to the sum of the evens. In Reverse it's delete self
accept other (referring to the 2 elements of a communicated
2-tuple).
I forgot to mention, anything that is needed that is NOT part of
the (variable-content) array can be specified as a constant in
the local operation function itself. Everything else must come
from the input array.
basepred and basefunc are discussed here.
So that's everything. Structure manipulation in the divide and combine functions;
orthogonal to communication which is the first applying part of the preadjust or postadjust functions;
orthogonal to local operation which is the second-applying part of the preadjust or postadjust functions.
Bibliography
John
Backus 1977 Turing Lecture.
H. BURKHARDT,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der
Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.
H. BURKHARDT,"Trigonometrische Reihenund Integrale(bis etwa 1850)," in Chap-
ter 12, Encyklopiidie der Mathematischen Wissenschaften,vol. 2, part 1, 2nd half,
1904-1916.
J. W. COOLEY & J. W. TUKEY, "An Algorithm for the Machine Calculation
of Complex Fourier Series," Math. Comput., Vol. 19, no. 2,
pp. 297-301, Reprinted in Digital SignalProcessing,ed.L. R.RABINER&
C. M. RADER,pp.223-227,New York: IEEE Press,1972,Apr. 1965.
C. F. Gauss, "Nachlass, Theoria Interpolationis Methodo Nova Tractata," in
Carl Friedrich Gauss Werke, Band 3, Koniglichen Gesellschaft der Wissenschaften:
Gottingen, pp. 265-330, 1866.
MICHAEL T. HEIDEMAN, DON H. JOHNSON, C. SIDNEY BURRUS
"Gauss and the History of the Fast Fourier Transform", 1985.
Rice U Eng. Dept.
Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar
Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier
Transform and Polynomial Multiplication Viterbi Faculty of Electrical
and Computer Engineering, Technion – Israel Institute of Technology,
Haifa, 3200003, Israel. ArXiv 2304.02336v1 [cs.AR]
5 Apr 2023.
Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing
Lecture. (ACM V 14 No 1, 1967).
Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its
Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795.
Z. George Mou, and Xiaojing Wang, Optimal Mappings of m Dimensional
FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp
104-119. in PARLE '93, Parallel Architectures and Languages Europe,
Lecture Notes in Computer Science #694, Springer-Verlag. 5th
International PARLE Conference, Munich, Germany, June 1993
Proceedings.
|