Algorithm 4.5: Scan without Broadcast

demonstrating broadcast dissolving, to reduce communication from O(n) to O(log(n))

On Mou
on Divide and Conquer

Introduction

I told George, I'm not that smart, but I'm somewhat determined, and so I'm getting to understand you a little. After some study I can report that I'm glad I had your personal conversation to guide me because in the writing the lessons come in contemplating the result at the end, so please spend more time talking about the basic basics, the tautological foundations, and how they inform the strategic direction.

See how far you can push the idea of a string of bits, which powers the Black White Array, as well as Dimensional Shuffle Transform patents, the (Communication-) Optimal Mapping paper, Or similarly, Local-Co-SM (there's a mnemonic!): these are the three ultimate, basic, and profound elements of computation: (Local) local calculation, (Co) communication and (SM) structure manipulation. Curiously, once specified, their mere juxtaposition falls into these sort of low-energy crystalline or even molecular structures which practically and optimally accomplish high-level design intentions, which seem like magic clockworks that are too complex to understand, except everything was easy and clear to specify and works together naturally. It's as though you have forced me mysteriously to finally achieve Clear Thinking!

I remember as a high school student learning to think analytically. Being clear about what I am describing, means taking it apart into its components, and showing how they fit together. Then I really know the thing. This approach is quite similar but in the domain of computation: the WAY I can be clear about what I am saying is by taking it apart into its components, which are these three Local-Co-SM, and having expressed those parts clearly, the way to put them back together is so general it might solve every problem. Maybe it does.

Bits and Local-Co-SM are ideas that are, on the one hand, so obvious, and on the other so powerful, and yet again so underappreciated and underutilized, that they deserve emphasis. So let me repeat what you can find here and there about the basic constitutive elements of computation being (Local) local calculation, (Co) communication, and (SM) structure manipulation.

  • Each has nothing of the others in it.
  • All computation consists of these and nothing more than these.
  • "Local" local calculation is free of communication; everything needed is directly present.
  • "Co" communication is free of calculation or structure manipulation: just send the data itself from sender to recipient, no calculations in transit, no structure modification either.
  • "SM" structure manipulation in the sense of dividing and combining vectors doesn't move data, doesn't calculate anything, just asserts that you can look at this stick as being divided into two, or those two as being laid end to end.
  • We will look at a detailed example, \(scan\oplus\), hopefully you'll see the theme: these constituents are explicitly separated, simply defined, and put together clearly, enabling a powerful efficient result.

    Let's begin.

    The Algorithm

    Algorithm 4.5 of George Mou's 1990 dissertation (henceforth "GM90") took me some work to go through. Here I'm going to go through it again saying no more nor less but the same, however giving more steps and detail so you may be able to follow it more easily than I did my first time through.

    The algorithm says, and I quote (replacing \(id\) with \(f_b\), and , with . in certain cases):

    $$\begin{align} scan \oplus &= PDC(d_{lr},c_{lr}, id, (h_{scan} \oplus), atom?, f_b)\\ where f_b\ x\ &=\ x.x \\ h_{scan} &= (!loc_1 \oplus, !loc_2 \oplus)\ :\ \#\ !\ corr \\ loc_1 ((x_1.s_1),(x_2.s_2)) &= (x_1 . (s_1\oplus s_2)) \\ loc_2 ((x_1.s_1),(x_2.s_2)) &= ((x_1 \oplus s_2) . (s_1\oplus s_2))\\ \end{align}$$

    Definitions

    This algorithm arequires quite a few definitions to make sense of it.

    Someone might say, Look it up. Or, having read it and understood it, you should not need to look it up.

    George would say, if you still don't understand it, at least trust it as an authority that is speaking truth, and do the work needed so you do understand it. Of course it may take some study.

    I say, I can't remember where I read it, and wasn't sure I understood it then or at least certainly not now, so I'll say it again here and try to put in a link to the original statement also in case of ambiguity. Forgive my redundancy, this discussion repeats the ideas in different sources, mainly in GM90.

    Discussion

    Now we are almost in the essence of the problem. Let me blather for a bit and then maybe you can just read the original statement and it'll suddenly click. Okay?

    \(scan \oplus\) here is a postadjust algorithm, meaning that all the action is in the combining phase which takes pairs of vectors to combined vectors at each next level. Also, it means that the basic data in all the vectors will be the output data type. This is an explicit conversion done by \(f_b\) at the leaves of the division tree. But it also must remain so at each level of combining partial result vector pairs on the way up.

    In designing a parallel algorithm, this is the discipline: make sure that the properties you rely upon to do a calculation continue to hold at the next level. That's why preadjust and postadjust functions are needed, not just to do local calculations on communicated and local data, but also to ensure that the outcome for every member is of the same type as at the previous level. Following that discipline means we can then reason clearly and logically up and down the tree and feel confident in conclusions that the whole thing will indeed operate as intended and yield the result we want.

    The property we want to maintain is that each element of each combining vector has the combined vector's final value, namely its scan result (2) along with its local element value (1).

    How could this be done? It seems we can't do it at all -- without magic -- since the scan result is only discovered at the end after doing a whole iterated scan of any vector. But No!, There is a magic trick because all elements in each combining vector have its end-of-vector scan result (2) already there with it, and by communicating with the corresponding element of the other vector, it can locally calculate the scan sum which will emerge at the end of the combined vector. The scan result at the end of two joined vectors is the sum of the independent scan results at the end of the each! So we sum together the two vector-final scan sums, everywhere in both the combining vectors. Magic!

    Indeed, every element in the combined vector can calculate that same final sum value locally, because in the first combining vector every element (1) already has with it its last value (2), and it receives from its corresponding element in the other half the last value of the second half(2). Symmetrically, in the second combining vector each element already has the value summarizing the second combining vector (2), and it receives from its corresponding element in the first combining vector its summary of the first (2). Add them, poof, you have the final scan result for the combined vector.

    Calculating these identical sums for each element in both halves of the combining vector is a lot of redundant calculation, but the point is it requires no further Co communication after the correspondent communication sends just one element across from each to each. (Conveniently \(\#!corr\) can be done in parallel without delays or conflicts in transit.) In the result, sure enough, every element in a vector, as mentioned above regarding \(f_b\), is in the output domain which is double, one may write it as \((x.s)\), with both (1) element-local values, \(x\), and (2) vector-final scan results \(s\). Each element thus contains its own scan value (1), and also it contains the vector-final scan value (2), which it locally calculated by making use of the communicated scan values from the other combining vector.

    Does this explanation help? I had to draw it out in detail to see how it works, and to be comfortable that the data flows feedforward in a way that is stable, and the claims made at one level remain true at the next.

    Remember, all the members of the output type vector get a local calculation; in the case of (2) it yields the same result for each member of both combining vectors, and so all members of the vector contain the combined vector's scan result (2). They don't need to receive more communications about the scan result for the combined vector, because they had enough to calculate it themselves: the scan result for one combining vector, and the communicated scan result from the other combining vector. Summing those two yields the final scan result for the about-to-be-joined vector. Was that magic? Only until I saw it!

    One point of mystery to me was, how does this work at one and two levels up from \(f_b\)? Combining two singleton vectors \( [ x_1.s_1 ] \) and \( [ x_2.s_2 ] \), do we capture the whole scan by summing \(s_1 + s_2\)? Obviously we do because this is the normal case of one-by-one scanning.

    What about two 2-member vectors \( [ \lt x_1.s_1 \gt, \lt x_2.s_2 \gt ] \) and \( [ \lt x_3.s_3 \gt, \lt x_4.s_4 \gt ] \)? How is it that the final of the second vector becomes the \(s\) element everywhere? Well first observe that already \(s_1 = s_2 = x_2\) and \(s_3 = s_4 = x_4 \) because every vector member contains the vector's final value. Second, we can work it out with paper and pencil, it's very simple to see, that summing the finals \(s_2 + s_4\) sums everything. Imagine an element in the first vector was increased by some number \(a\). Won't every element after it including every element in the second vector be increased by \(a\)? Yes. So the sum at the end of one vector influences the outcomes across the second vector. By how much? By that sum. We do capture the whole scan result of both vectors together by just summing \(s_2 + s_4\), because each final contains their whole vector's result so far. When we add \(s_2\) to all the elements in the second vector (which we do since in one case we add \(s_1=s_2\) to the first element of the second vector and the other case we add \(s_2\) to the second element of the second vector) we distribute its effect across the whole second vector. And vice versa, from the second vector to the first, so now each element in both vectors has the final sum of the joined vector. Doing this repeatedly at every level maintains the proposition that each element contains its vector's final element even as we build up larger levels.

    Now we are really ready to define "Local", those local calculations, of which there are two types: \(loc_1\) and \(loc_2\). There are two sides of the communication, which is bidirectional. Both sides add up their own with the other fellow's temporary scan results and store the sum (separately calculated, but identical in value) into each's scan result spot (2). And one side is considered the low side, so it keeps its low value (1), all done. The other side, the high side, because it is scanning on top of the previous sums, will add in the low-side's scan-result(2) to its local value (1), that is its new local scan value (1).

    That is the idea behind the \(loc_1\) and \(loc_2\) functions, which put identical sums into position (2), and on the low side just leave \(x_1\) there, but on the high side actually perform the next-level, incremental, full-vector-pair scan operation by adding the final-results-so-far from the other side to itself.

    So we put all this extra structure around the scan operation, the division and combination, the basepredicate and the utmost-simple basefunc into the output double-domain, and we postadjust the results by updating partial vector final scan values locally to every element, and reducing the actually-sequential scan operation to one bit inside \(h_{scan}\) namely "\(x_1 \oplus s_2\)" -- and even that is a recursive and subdivided summarizing operation, so that really, everything is summaries moving around, and they accumulate half-vectors at a time, into full-vectors.

    You can skip the following detailed, annotated divacon computation graph if you feel you understand this. I really struggled to understand it, even though afterward it doesn't seem so complicated, so I did this to see each step. Let's follow GM90 Figure 4.7 and calculate scan + (0 1 2 3 4 5 6 7):

    [ 0 1 2 3 4 5 6 7 ] \(f = scan \oplus\) on this input first calls \(d_{lr}\) #1:
    [ 0 1 2 3 ] [ 4 5 6 7 ] Then \(!scan \oplus\) applies, thus 2x of
    [ 0 1 ] [ 2 3 ] [ 4 5 ] [ 6 7 ] \(d_{lr}\) #2, & of \(! scan \oplus\) again, thus 2 each more of:
    [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] \(d_{lr}\) #3. Now \(! f\) checks \(atom?\): true.
    [ 0.0 ] [ 1.1 ] [ 2.2 ] [ 3.3 ] [ 4.4 ] [ 5.5 ] [ 6.6 ] [ 7.7 ] Apply \(f_b\) to singleton vectors
    [ (0.1, 1.1) ] [ (1.1, 0.0) ] [ (2.2, 3.3) ] [ (3.3, 2.2) ] [ (4.4, 5.5) ] [ (5.5, 4.4) ] [ (6.6, 7.7) ] [ (7.7, 6.6) ] \(\#!corr\) takes adjacent pairs of vectors and communicates each to each
    \([ (loc_1 +)(0.1, 1.1) ] \)\([ (loc_2 +)(1.1, 0.0) ]\) \([ (loc_1 +)(2.2, 3.3) ] \)\([ (loc_2 +)(3.3, 2.2) ]\) \([ (loc_1 +)(4.4, 5.5) ] \)\([ (loc_2 +)(5.5, 4.4) ]\) \([ (loc_1 +)(6.6, 7.7) ] \)\([ (loc_2 +)(7.7, 6.6) ]\) distribute one \(loc +\) function per 4#'s.
    \( [ 0.(0+1) ] \)\([ (1+0).(1+0)]\) \( [ 2.(2+3) ] \)\([ (3+2).(3+2)]\) \( [ 4.(4+5) ] \)\([ (5+4).(5+4)]\) \( [ 6.(6+7) ] \)\([ (7+6).(7+6)]\) Local calculations
    \([ 0.1 ]\)\([ 1.1 ]\) \([ 2.5 ]\)\([ 5.5 ])\) \([ 4.9 ]\)\([ 9.9 ]\) \([ 6.13 ]\)\([ 13.13 ]\) Local calculations
    \([ 0.1, 1.1 ]\)\([ 2.5, 5.5 ]\) \([ 4.9, 9.9 ]\)\([ 6.13, 13.13 ]\) \(c_{lr}\) #3. Level #3 is done.
    \([ (0.1, 2.5), (1.1, 5.5) ]\)\( [ (2.5, 0.1), (5.5, 1.1) ]\) \([ (4.9, 6.13), (9.9, 13.13) ]\)\( [ (6.13, 4.9), (13.13,9.9) ]\) \(post\) starts with \(\#!corr\) communication between array pairs
    \((!loc_1 +)[ (0.1, 2.5), (1.1, 5.5) ]\)\( (!loc_2 + )[ (2.5, 0.1), (5.5, 1.1) ]\) \((!loc_1 +)[ (4.9, 6.13), (9.9, 13.13) ]\)\( (!loc_2 +)[ (6.13, 4.9), (13.13,9.9) ]\) \((!loc_1+, !loc_2+)\) distributes each \(loc\) function over one of a pair of adjacent vectors.
    \([ (loc_1 +)(0.1, 2.5), (loc_1 +)(1.1, 5.5) ]\)\( [ (loc_2 +)(2.5, 0.1), (loc_2 +)(5.5, 1.1) ]\) \([ (loc_1 +)(4.9, 6.13), (loc_1 +)(9.9, 13.13) ]\)\( [ (loc_2 +)(6.13, 4.9), (loc_2 +)(13.13,9.9) ]\) here they are seen distributed.
    \([ 0.(1+5), 1.(1+5) ]\)\( [ (2+1).(5+1), (5+1).(5+1) ]\) \([ 4.(9+13), 9.(9+13) ]\)\( [ (6+9).(13+9), (13+9).(13+9) ]\) Apply the formulas
    \([ 0.6, 1.6 ]\)\( [ 3.6, 6.6 ]\) \([ 4.22, 9.22 ]\)\( [ 15.22, 22.22 ]\) Do the arithmetic
    \([ 0.6, 1.6, 3.6, 6.6 ]\)\( [ 4.22, 9.22, 15.22, 22.22 ]\) \(c_{lr}\) #2. Level #2 is done. One more \(post\) and \(c_{lr}\).
    \([ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]\)\( [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]\) \(\#!corr\) between 4-element arrays
    \((!loc_1+)[ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]\)\((!loc_2+) [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]\) Apply \(!loc_1+\) to the first, \(!loc_2+\) to the second
    \([ (loc_1+)(0.6,4.22), (loc_1+)(1.6,9.22), (loc_1+)(3.6,15.22), (loc_1+)(6.6,22.22) ]\)\( [ (loc_2+)(4.22,0.6), (loc_2+)(9.22, 1.6), (loc_2+)(15.22, 3.6), (loc_2+)(22.22, 6.6) ]\) Distributed
    \([ 0.28, 1.28, 3.28, 6.28 ]\)\( [ 10.28, 15.28, 21.28. 28.28 ]\) Carry out the local calculations
    \([ 0.28, 1.28, 3.28, 6.28, 10.28, 15.28, 21.28. 28.28 ]\) \(c_{lr}\) #1. Done!
    \(28\) You might want to pull out an array of values(1), or only the last value; whatever you like.
    The output is a vector of dotted elements, not a single number; so that bit of bookkeeping seems to be a detail left outside of the statement of the algorithm.

    Did you notice how the function stack, built up by the time of \(f_b\), is a lot of promises carried out in the rest of the process? Good.

    Communication without Broadcast

    Did you notice that \(corr\) and \((f,f)\) apply to pairs of vectors? Division doesn't mean that calculations cannot recieve communications from other vectors. It's not that you keep them nearby, but that conceptually you know what vectors pair together and that you can communicate between them, to get their possibly remote data into the local vector. We let communication handle a lot here.

    In fact, in Divacon, communication is a first-class part of the language. In no other high-level computer language is this the case. Given that computation is all and only LocalCoSM, Local calculation, communication, and structure manipulation, the three orthogonal components of computation, it is amazing that noone has surfaced Co, communication, leaving it always to compilers and interpreters and unexpressed in statements of algorithms. When communication is made explicit in the Divacon language, calculating its costs and optimizing its operation are made possible. Not only is it difficult to think about what one never expresses, but when the compiler and chip designer has no hint about how to do communication within an algorithm, the most cumbersome, turn-taking, blocking, all-to-all, general-but-unbelieveably-slow methods are used. Whereas we have seen here in Algorithm 4.5 communications which are:

    Thus communications can be made very low in time cost.

    Considering that interprocessor communications can be 10x or 1000x the time cost of a local calculation, not to mention the resulting burdens and costs and complexities of multi-layered cache memory architectures, we really have to pay attention to computational intensity: the ratio between local calculations and communications. Computational intensity must be thought about and optimized, and the main solution to that is to have a communication model that can recursively decompose \(O(n)\) expensive (and, OMG, serial!) communications into \(O(log(n))\) cheap parallel communications, like \(\#!corr\).

    Broadcast Dissolving

    George points out that Algorithm 4.5 exemplifies "Broadcast Dissolving. which in fact can be applied to many other problems." If I may be more specific, Broadcast Dissolving means log(n) parallel correspondent communication phases, whole-vector to whole-vector all element-pairwise in parallel, each followed by local calculations that yield an array-wide shared value separately calculated and stored with each member of each array. For example, notice (1+5), (1+5), (5+1), (5+1) just above \(c_{lr}\) #2 in the above example, each yielding 6. Then each arrives at a shared outcome without ever actually communicating 1-to-many. Broadcast, the slowest communication pattern, is replaced by correspondent communication, which can be done all in one step at each of log(n) levels, without interference or mutual blocking. Since communication time is much greater than local calculation time, and since the redundant local calculations are done in parallel -- simultaneously -- we vastly reduce algorithm time cost through Broadcast Dissolving.

    Thank you for your attention, and any comments or corrections are appreciated.

    Bibliography

    Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795.

    Z. George Mou and Xiaojing Wang, Optimal Mappings of m Dimensional FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp 104-119. in PARLE '93, Parallel Architectures and Languages Europe, Lecture Notes in Computer Science #694, Springer-Verlag. 5th International PARLE Conference, Munich, Germany, June 1993 Proceedings.  

     

    Your thoughts?
    (will not be shared or abused)
    Comment:
                                              Feedback is welcome.
    Copyright © 2024 Thomas C. Veatch. All rights reserved.
    Created: July 22, 2024