See how far you can push the idea of a string of bits, which powers the Black White Array, as well as Dimensional Shuffle Transform patents, the (Communication-) Optimal Mapping paper, Or similarly, Local-Co-SM (there's a mnemonic!): these are the three ultimate, basic, and profound elements of computation: (Local) local calculation, (Co) communication and (SM) structure manipulation. Curiously, once specified, their mere juxtaposition falls into these sort of low-energy crystalline or even molecular structures which practically and optimally accomplish high-level design intentions, which seem like magic clockworks that are too complex to understand, except everything was easy and clear to specify and works together naturally. It's as though you have forced me mysteriously to finally achieve Clear Thinking!
I remember as a high school student learning to think analytically. Being clear about what I am describing, means taking it apart into its components, and showing how they fit together. Then I really know the thing. This approach is quite similar but in the domain of computation: the WAY I can be clear about what I am saying is by taking it apart into its components, which are these three Local-Co-SM, and having expressed those parts clearly, the way to put them back together is so general it might solve every problem. Maybe it does.
Bits and Local-Co-SM are ideas that are, on the one hand, so obvious, and on the other so powerful, and yet again so underappreciated and underutilized, that they deserve emphasis. So let me repeat what you can find here and there about the basic constitutive elements of computation being (Local) local calculation, (Co) communication, and (SM) structure manipulation.
We will look at a detailed example, \(scan\oplus\), hopefully you'll see the theme: these constituents are explicitly separated, simply defined, and put together clearly, enabling a powerful efficient result.Each has nothing of the others in it. All computation consists of these and nothing more than these. "Local" local calculation is free of communication; everything needed is directly present. "Co" communication is free of calculation or structure manipulation: just send the data itself from sender to recipient, no calculations in transit, no structure modification either. "SM" structure manipulation in the sense of dividing and combining vectors doesn't move data, doesn't calculate anything, just asserts that you can look at this stick as being divided into two, or those two as being laid end to end.
Let's begin.
The algorithm says, and I quote (replacing \(id\) with \(f_b\), and , with . in certain cases):
$$\begin{align} scan \oplus &= PDC(d_{lr},c_{lr}, id, (h_{scan} \oplus), atom?, f_b)\\ where f_b\ x\ &=\ x.x \\ h_{scan} &= (!loc_1 \oplus, !loc_2 \oplus)\ :\ \#\ !\ corr \\ loc_1 ((x_1.s_1),(x_2.s_2)) &= (x_1 . (s_1\oplus s_2)) \\ loc_2 ((x_1.s_1),(x_2.s_2)) &= ((x_1 \oplus s_2) . (s_1\oplus s_2))\\ \end{align}$$
Someone might say, Look it up. Or, having read it and understood it, you should not need to look it up.
George would say, if you still don't understand it, at least trust it as an authority that is speaking truth, and do the work needed so you do understand it. Of course it may take some study.
I say, I can't remember where I read it, and wasn't sure I understood it then or at least certainly not now, so I'll say it again here and try to put in a link to the original statement also in case of ambiguity. Forgive my redundancy, this discussion repeats the ideas in different sources, mainly in GM90.
The general Divacon parallel program can be expressed like this:
$$\begin{align} f = (\ \ p_b\ ? &\ f_b\\ \text{else}\ &\ c : post :\ !\ f : pre : d )\\ \end{align}$$
\(d_{lr}\): Split an array into two comprising left and right halves. "Cut a stick in half."
\(c_{lr}\): Join two arrays into one by placing one on the left, the other on the right. "Put two sticks end to end."
But we do not care if it is or is not commutative, because all the activity inside this PDC will not be reordering the data sequence only the operation sequence. So the operator does not need to be commutative.George says he used to raise his hand at conferences when speakers said "associative and commutative" in such a case, and declare the statement incorrect if an algorithm required an associative but not necessarily commutative operator. After some years being ignored, he stopped raising his hand.
His point being, an algorithm should be stated at the correct level of abstraction, without unnecessary conditions. Adding the restriction of commutativity, therefore, is actually wrong, rather than merely failing to be even more general. You might say, that's a bit pedantic, and maybe so. On the other hand, as I have read, George lost friends in the Cultural Revolution to execution and imprisonment because they said a little too much. In honor of them, let's try to make our words count, then, okay?
George says I make a habit of interpolating unnecessary conditions, so please help me find such errors in these pages, won't you?
\(scan \oplus\) here is a postadjust algorithm, meaning that all the action is in the combining phase which takes pairs of vectors to combined vectors at each next level. Also, it means that the basic data in all the vectors will be the output data type. This is an explicit conversion done by \(f_b\) at the leaves of the division tree. But it also must remain so at each level of combining partial result vector pairs on the way up.
In designing a parallel algorithm, this is the discipline: make sure that the properties you rely upon to do a calculation continue to hold at the next level. That's why preadjust and postadjust functions are needed, not just to do local calculations on communicated and local data, but also to ensure that the outcome for every member is of the same type as at the previous level. Following that discipline means we can then reason clearly and logically up and down the tree and feel confident in conclusions that the whole thing will indeed operate as intended and yield the result we want.The property we want to maintain is that each element of each combining vector has the combined vector's final value, namely its scan result (2) along with its local element value (1).
How could this be done? It seems we can't do it at all -- without magic -- since the scan result is only discovered at the end after doing a whole iterated scan of any vector. But No!, There is a magic trick because all elements in each combining vector have its end-of-vector scan result (2) already there with it, and by communicating with the corresponding element of the other vector, it can locally calculate the scan sum which will emerge at the end of the combined vector. The scan result at the end of two joined vectors is the sum of the independent scan results at the end of the each! So we sum together the two vector-final scan sums, everywhere in both the combining vectors. Magic!
Indeed, every element in the combined vector can calculate that same final sum value locally, because in the first combining vector every element (1) already has with it its last value (2), and it receives from its corresponding element in the other half the last value of the second half(2). Symmetrically, in the second combining vector each element already has the value summarizing the second combining vector (2), and it receives from its corresponding element in the first combining vector its summary of the first (2). Add them, poof, you have the final scan result for the combined vector.
Calculating these identical sums for each element in both halves of the combining vector is a lot of redundant calculation, but the point is it requires no further Co communication after the correspondent communication sends just one element across from each to each. (Conveniently \(\#!corr\) can be done in parallel without delays or conflicts in transit.) In the result, sure enough, every element in a vector, as mentioned above regarding \(f_b\), is in the output domain which is double, one may write it as \((x.s)\), with both (1) element-local values, \(x\), and (2) vector-final scan results \(s\). Each element thus contains its own scan value (1), and also it contains the vector-final scan value (2), which it locally calculated by making use of the communicated scan values from the other combining vector.
Does this explanation help? I had to draw it out in detail to see how it works, and to be comfortable that the data flows feedforward in a way that is stable, and the claims made at one level remain true at the next.
Remember, all the members of the output type vector get a local calculation; in the case of (2) it yields the same result for each member of both combining vectors, and so all members of the vector contain the combined vector's scan result (2). They don't need to receive more communications about the scan result for the combined vector, because they had enough to calculate it themselves: the scan result for one combining vector, and the communicated scan result from the other combining vector. Summing those two yields the final scan result for the about-to-be-joined vector. Was that magic? Only until I saw it!
One point of mystery to me was, how does this work at one and two levels up from \(f_b\)? Combining two singleton vectors \( [ x_1.s_1 ] \) and \( [ x_2.s_2 ] \), do we capture the whole scan by summing \(s_1 + s_2\)? Obviously we do because this is the normal case of one-by-one scanning.
What about two 2-member vectors \( [ \lt x_1.s_1 \gt, \lt x_2.s_2 \gt ] \) and \( [ \lt x_3.s_3 \gt, \lt x_4.s_4 \gt ] \)? How is it that the final of the second vector becomes the \(s\) element everywhere? Well first observe that already \(s_1 = s_2 = x_2\) and \(s_3 = s_4 = x_4 \) because every vector member contains the vector's final value. Second, we can work it out with paper and pencil, it's very simple to see, that summing the finals \(s_2 + s_4\) sums everything. Imagine an element in the first vector was increased by some number \(a\). Won't every element after it including every element in the second vector be increased by \(a\)? Yes. So the sum at the end of one vector influences the outcomes across the second vector. By how much? By that sum. We do capture the whole scan result of both vectors together by just summing \(s_2 + s_4\), because each final contains their whole vector's result so far. When we add \(s_2\) to all the elements in the second vector (which we do since in one case we add \(s_1=s_2\) to the first element of the second vector and the other case we add \(s_2\) to the second element of the second vector) we distribute its effect across the whole second vector. And vice versa, from the second vector to the first, so now each element in both vectors has the final sum of the joined vector. Doing this repeatedly at every level maintains the proposition that each element contains its vector's final element even as we build up larger levels.
Now we are really ready to define "Local", those local calculations, of which there are two types: \(loc_1\) and \(loc_2\). There are two sides of the communication, which is bidirectional. Both sides add up their own with the other fellow's temporary scan results and store the sum (separately calculated, but identical in value) into each's scan result spot (2). And one side is considered the low side, so it keeps its low value (1), all done. The other side, the high side, because it is scanning on top of the previous sums, will add in the low-side's scan-result(2) to its local value (1), that is its new local scan value (1).
That is the idea behind the \(loc_1\) and \(loc_2\) functions, which put identical sums into position (2), and on the low side just leave \(x_1\) there, but on the high side actually perform the next-level, incremental, full-vector-pair scan operation by adding the final-results-so-far from the other side to itself.
So we put all this extra structure around the scan operation, the division and combination, the basepredicate and the utmost-simple basefunc into the output double-domain, and we postadjust the results by updating partial vector final scan values locally to every element, and reducing the actually-sequential scan operation to one bit inside \(h_{scan}\) namely "\(x_1 \oplus s_2\)" -- and even that is a recursive and subdivided summarizing operation, so that really, everything is summaries moving around, and they accumulate half-vectors at a time, into full-vectors.
You can skip the following detailed, annotated divacon computation graph if you feel you understand this. I really struggled to understand it, even though afterward it doesn't seem so complicated, so I did this to see each step. Let's follow GM90 Figure 4.7 and calculate scan + (0 1 2 3 4 5 6 7):
[ 0 1 2 3 4 5 6 7 ] | \(f = scan \oplus\) on this input first calls \(d_{lr}\) #1: | |||||||
[ 0 1 2 3 ] | [ 4 5 6 7 ] | Then \(!scan \oplus\) applies, thus 2x of | ||||||
[ 0 1 ] | [ 2 3 ] | [ 4 5 ] | [ 6 7 ] | \(d_{lr}\) #2, & of \(! scan \oplus\) again, thus 2 each more of: | ||||
[ 0 ] | [ 1 ] | [ 2 ] | [ 3 ] | [ 4 ] | [ 5 ] | [ 6 ] | [ 7 ] | \(d_{lr}\) #3. Now \(! f\) checks \(atom?\): true. |
[ 0.0 ] | [ 1.1 ] | [ 2.2 ] | [ 3.3 ] | [ 4.4 ] | [ 5.5 ] | [ 6.6 ] | [ 7.7 ] | Apply \(f_b\) to singleton vectors |
[ (0.1, 1.1) ] | [ (1.1, 0.0) ] | [ (2.2, 3.3) ] | [ (3.3, 2.2) ] | [ (4.4, 5.5) ] | [ (5.5, 4.4) ] | [ (6.6, 7.7) ] | [ (7.7, 6.6) ] | \(\#!corr\) takes adjacent pairs of vectors and communicates each to each |
\([ (loc_1 +)(0.1, 1.1) ] \) | \([ (loc_2 +)(1.1, 0.0) ]\) | \([ (loc_1 +)(2.2, 3.3) ] \) | \([ (loc_2 +)(3.3, 2.2) ]\) | \([ (loc_1 +)(4.4, 5.5) ] \) | \([ (loc_2 +)(5.5, 4.4) ]\) | \([ (loc_1 +)(6.6, 7.7) ] \) | \([ (loc_2 +)(7.7, 6.6) ]\) | distribute one \(loc +\) function per 4#'s. |
\( [ 0.(0+1) ] \) | \([ (1+0).(1+0)]\) | \( [ 2.(2+3) ] \) | \([ (3+2).(3+2)]\) | \( [ 4.(4+5) ] \) | \([ (5+4).(5+4)]\) | \( [ 6.(6+7) ] \) | \([ (7+6).(7+6)]\) | Local calculations |
\([ 0.1 ]\) | \([ 1.1 ]\) | \([ 2.5 ]\) | \([ 5.5 ])\) | \([ 4.9 ]\) | \([ 9.9 ]\) | \([ 6.13 ]\) | \([ 13.13 ]\) | Local calculations |
\([ 0.1, 1.1 ]\) | \([ 2.5, 5.5 ]\) | \([ 4.9, 9.9 ]\) | \([ 6.13, 13.13 ]\) | \(c_{lr}\) #3. Level #3 is done. | ||||
\([ (0.1, 2.5), (1.1, 5.5) ]\) | \( [ (2.5, 0.1), (5.5, 1.1) ]\) | \([ (4.9, 6.13), (9.9, 13.13) ]\) | \( [ (6.13, 4.9), (13.13,9.9) ]\) | \(post\) starts with \(\#!corr\) communication between array pairs | ||||
\((!loc_1 +)[ (0.1, 2.5), (1.1, 5.5) ]\) | \( (!loc_2 + )[ (2.5, 0.1), (5.5, 1.1) ]\) | \((!loc_1 +)[ (4.9, 6.13), (9.9, 13.13) ]\) | \( (!loc_2 +)[ (6.13, 4.9), (13.13,9.9) ]\) | \((!loc_1+, !loc_2+)\) distributes each \(loc\) function over one of a pair of adjacent vectors. | ||||
\([ (loc_1 +)(0.1, 2.5), (loc_1 +)(1.1, 5.5) ]\) | \( [ (loc_2 +)(2.5, 0.1), (loc_2 +)(5.5, 1.1) ]\) | \([ (loc_1 +)(4.9, 6.13), (loc_1 +)(9.9, 13.13) ]\) | \( [ (loc_2 +)(6.13, 4.9), (loc_2 +)(13.13,9.9) ]\) | here they are seen distributed. | ||||
\([ 0.(1+5), 1.(1+5) ]\) | \( [ (2+1).(5+1), (5+1).(5+1) ]\) | \([ 4.(9+13), 9.(9+13) ]\) | \( [ (6+9).(13+9), (13+9).(13+9) ]\) | Apply the formulas | ||||
\([ 0.6, 1.6 ]\) | \( [ 3.6, 6.6 ]\) | \([ 4.22, 9.22 ]\) | \( [ 15.22, 22.22 ]\) | Do the arithmetic | ||||
\([ 0.6, 1.6, 3.6, 6.6 ]\) | \( [ 4.22, 9.22, 15.22, 22.22 ]\) | \(c_{lr}\) #2. Level #2 is done. One more \(post\) and \(c_{lr}\). | ||||||
\([ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]\) | \( [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]\) | \(\#!corr\) between 4-element arrays | ||||||
\((!loc_1+)[ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]\) | \((!loc_2+) [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]\) | Apply \(!loc_1+\) to the first, \(!loc_2+\) to the second | ||||||
\([ (loc_1+)(0.6,4.22), (loc_1+)(1.6,9.22), (loc_1+)(3.6,15.22), (loc_1+)(6.6,22.22) ]\) | \( [ (loc_2+)(4.22,0.6), (loc_2+)(9.22, 1.6), (loc_2+)(15.22, 3.6), (loc_2+)(22.22, 6.6) ]\) | Distributed | ||||||
\([ 0.28, 1.28, 3.28, 6.28 ]\) | \( [ 10.28, 15.28, 21.28. 28.28 ]\) | Carry out the local calculations | ||||||
\([ 0.28, 1.28, 3.28, 6.28, 10.28, 15.28, 21.28. 28.28 ]\) | \(c_{lr}\) #1. Done! | |||||||
\(28\) | You might want to pull out an array of values(1), or only the last value; whatever you like. |
Did you notice how the function stack, built up by the time of \(f_b\), is a lot of promises carried out in the rest of the process? Good.
In fact, in Divacon, communication is a first-class part of the language. In no other high-level computer language is this the case. Given that computation is all and only LocalCoSM, Local calculation, communication, and structure manipulation, the three orthogonal components of computation, it is amazing that noone has surfaced Co, communication, leaving it always to compilers and interpreters and unexpressed in statements of algorithms. When communication is made explicit in the Divacon language, calculating its costs and optimizing its operation are made possible. Not only is it difficult to think about what one never expresses, but when the compiler and chip designer has no hint about how to do communication within an algorithm, the most cumbersome, turn-taking, blocking, all-to-all, general-but-unbelieveably-slow methods are used. Whereas we have seen here in Algorithm 4.5 communications which are:
Thus communications can be made very low in time cost.
Considering that interprocessor communications can be 10x or 1000x the time cost of a local calculation, not to mention the resulting burdens and costs and complexities of multi-layered cache memory architectures, we really have to pay attention to computational intensity: the ratio between local calculations and communications. Computational intensity must be thought about and optimized, and the main solution to that is to have a communication model that can recursively decompose \(O(n)\) expensive (and, OMG, serial!) communications into \(O(log(n))\) cheap parallel communications, like \(\#!corr\).
Thank you for your attention, and any comments or corrections are appreciated.
Z. George Mou and Xiaojing Wang, Optimal Mappings of m Dimensional FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp 104-119. in PARLE '93, Parallel Architectures and Languages Europe, Lecture Notes in Computer Science #694, Springer-Verlag. 5th International PARLE Conference, Munich, Germany, June 1993 Proceedings.