Scan without Broadcast

Parallel Divide and Conquer gets interesting

On Mou
on Divide and Conquer

Scan with broadcast depends on \(O(N)\) communication delays at each step, to carry out the broadcast to \(N\) recipients. These delays might be dissolved into the \(O(log(N))\) steps of a Divacon program, if we could locally calculate, within each side of a divided array pair H0 and H1, the value of "last 0" which would otherwise require \(O(N)\)-time communications. Can we send to all of H0 the last of H0, and to all of H1 the last of H1, by parallel calculations instead of serial communications?

Yes. The recursion has a sound base case and a sound recursion rule.

  • In the base case, a singleton array contains its scan sum, the sum of all the elements in the array, because the sum of one value is that value itself.

  • In the recursive case, given sub-arrays H0 and H1 each of whose elements contain the sub-array-final scan sum, every element in the combined array can be made to contain the combined-array-final scan sum, by adding together those two scan sums. The other side's scan sum can be gotten from any element on the other side, thus correspondent communication (or mirror, or any other lockstep communication between two arrays) will do it.

Now that the scan sum of the H0 array is available everywhere in H1, we can add it to each element of H1. This yields an inside-out scan addition sequence. Instead of adding left to right one by one all the way across, we can do it a whole array at a time, by adjusting the entire, H1, results-so-far array by adding to each element a summary constant from the H0 array. H1 was already a self-contained scan sequence; adding the H0 constant to each one lifts H1 into position as the scan successor of H0. Did it make sense?

To make this fully concrete, here are algorithm definitions and an example computation graph with discussion following. For the elements not here defined, \(:\ \#\ !\ d\ c\ lr\ corr\ atom?\ f_b\), see the Tutorial.

Scan (without broadcast)
\(scan\ =\ !self : PDC(d_{lr},c_{lr},id,(!LO_0, !LO_1) : \#\ !\ corr, atom?, f_b)\)

where:

\(f_b\ a = a\ . a \)      Each gets a copy of itself at the bottom.

\(LO_{0}\ (a0.b0)\ .(a1.b1) = a0\ . b0+b1\)           Local Op within LHS (array 0) just copies itself but also locally calculates scan sum for doubled subarray.

\(LO_{1}\ (a0.b0)\ .(a1.b1) = a0+b1\ . b0+b1\)      Local Op within RHS (array 1) adds LHS scan sum, \(b1\); but also makes scan sum for doubled array.

Sample Input   0     1     2     3     4     5     6     7   preadjust=id so keep dividing
\(d_{lr}\) 3x 0 1 2 3 4 5 6 7 stopped at atom?=true
\(f_b\) 0.0 1.1 2.2 3.3 4.4 5.5 6.6 7.7 output
data type
is a pair, self.sum
# ! corr (1st) (0.0).(1.1) (1.1).(0.0) (2.2).(3.3) (3.3).(2.2) (4.4).(5.5) (5.5).(4.4) (6.6).
(7.7)
(7.7).
(6.6)
each got corr's contents
\((!LO_0,!LO_1)\) 0.0+1
=0.1
1+0.1+0
=1.1
2.2+3
=2.5
3+2.3+2
=5.5
4.4+5
=4.9
5+4.5+4
=9.9
6.6+7
=6.13
7+6.7+6
=13.13
LHS: id. RHS: sum. Next: \(c_{lr}\)
\(c_{lr}\)(1st) 0.1 1.1 2.5 5.5 4.9 9.9 6.13 13.13 next is # ! corr
# ! corr (2nd) (0.1).(2.5) (1.1).(5.5) (2.5).(0.1) (5.5).(1.1) (4.9).(6.13) (9.9).
(13.13)
(6.13).
(4.9)
(13.13).
(9.9)
each sent self to corr
\((!LO_0,!LO_1)\) 0.1+5
=0.6
1.1+5
=1.6
2+1.5+1
=3.6
5+1.5+1
=6.6
4.9+13
=4.22
9.9+13
=9.22
6+9.13+9
=15.22
13+9.13+9
=22.22
ready for 2nd \(c_{lr}\)
\(c_{lr} (2nd)\) 0.6 1.6 3.6 6.6 4.22 9.22 15.22 22.22 now it's 4-arrays.
# ! corr (3rd) (0.6).(4.22) (1.6).(9.22) (3.6).(15.22) (6.6).(22.22) (4.22).(0.6) (9.22).
(1.6)
(15.22).(3.6) (22.22).
(6.6)
(C)ommunicate, (L)ocally calculate
\((!LO_0,!LO_1)\) 0.6+22
=0.28
1.6+22
=1.28
3.6+22
=3.28
6.6+22
=6.28
4+6.22+6
=10.28
9+6.22+6
=15.28
15+6.22+6
=21.28
22+6.22+6
=28.28
(S)tructure manipulate (\(c_{lr}\))
\(c_{lr}\) (3rd) yields
an 8-array
0.28 1.28 3.28 6.28 10.28 15.28 21.28 28.28 This is the PDC output. Last, drop \(other\).
\(! self\) 0 1 3 6 10 15 21 28 All done

In case you are tracking with George's thesis, Mou 1990 p84, this presentation differs from his immaterially and for perhaps personal reasons. See Footnote below.

The invariant fact in this algorithm is that every level's output-side array, that is, both after \(f_b\) and after \(c_{lr}\), in every resulting pair \(a\ . b\), the value \(b\) is always the scan-sum of the whole array, even though it is separately calculated for each position. This miracle remains true because the scan sums on left and right arrays before combination, which are redundantly added in each position, themselves sum to the scan sum of the combined array.

At each of three levels, the array sizes divide in half until base arrays have only one element.

I like to use the H0 H1 naming convention. In general a division turns a full length array, call it F, into two half-length arrays, H0 (on the left) and H1 (on the right), and combination turns two half-length arrays, H0 (on the left) and H1 (on the right), into a full length array, F. Whether after dividing or before combining, any communication that occurs, when it occurs, occurs between H0 and H1. This pattern holds at every level of division and every level of combination, no matter what the size of the top level input array and how deep is the computation graph.

The 'last0' communication pattern says take the last element of H0 and send it to each and every element in H1. Since it is unidirectional there is no '!'; however if you wanted to also send from the last of H1 to all of H0, you could perhaps write the communication pattern as "# ! last" which reads as "communicate bidirectionally from the last of each half to all members of the other half".

Although this "scan with broadcast" version is easy to write and easy to understand unrolled, there is a hidden cost here in that "last 0" is not a task that can be done independently in all the positions at once, assuming a serial processor in each location. No, it has to send to the first, then to the second, then to the third, etc. so the total time cost is O(N) with N the size of the communicating sub-arrays, which halves at each level of division (in the case of pre-communication) and doubles at each level of combination (in the case of post-communication), so adding those all up using the applicable triangle area formula we get O(N*log(N)/2) which further reduces to O(N*log(N)) by the funny rules of Order-of-Magnitude reasoning. One would prefer it to be O(log(N)): Can we do it? Yes. See Scan W/O Broadcast.

Footnote

The computation graph differs from Mou 1990 p84 in four ways:
  • (1) I replaced the \(h_{scan}\) function call in the 4th argument of PDC() with its expansion, to reduce the number of uncomprehended layers I need to consider in my mind at once.

  • (2) I renamed his functions \(loc1 +\) and \(loc2 +\) as \(LO_0\) and \(LO_1\) because I couldn't tell whether \(loc\) was intended to refer to a \(location\) or a \(local\ operation\), both being consistent with the text and each leaving a mental question mark in the mind when reading it. I discovered the latter is correct when George did this as a pencil exercise with me in person. Anyway this function naming is unambiguous.

  • (3) I renamed his variables \((x,sum) (x1,sum1),(x2,sum2)\) as \(a0.b0, a1.b1\) because \(sum\) is ambiguous between (1) the scan sum at the end of the array, (2) the scan sum for the whole array calculated redundantly at each point in the array, (3) the scan sum taken from the other array, and (4) the running sum that builds up from left to right as one does the scan operation in one's mind. Although these are related, the path through this thicket is clear if (2) is taken as correct. The \(a\ . b\) notation is less ambiguous to the learner, though also less evocative.

  • (4) I expanded the output display to include additional output lines after \(\# ! corr\) and after \((!LO_0,!LO_1)\). While producing a much busier computation diagram, this helped me to see the concrete obviousness of every calculation, during the phase of getting the idea.

Is it personal? One must get comfortable with both the algorithm's calculations and the scope and meaning of the different functions. The mind scans back and forth trying out different, possible, joint interpretations, seeking the comforting redundancy of confirmation of trial answers, which indicates "I understand".

In the face of those four layers of nesting-complexity, uncertainty, possible misdirection, and information-hiding, I was unable to penetrate this without George's personal intervention with a watch-over-my-shoulder pencil exercise (for scan with broadcast), during which references became clear and the whole thing rather obvious.

My personal struggle with this sheds a bright and helpful light on the struggles of students learning literacy especially using letter names in English spelling (ABC's). Without concrete reliability of interpretation for the written elements of communication, whether letters in English or functions, variables, and operators in math, even a short sentence or equation rapidly becomes impenetrable to the multiple-ambiguity-limited human mind. What is the maximum number of placeholder question marks we can handle in our minds while trying out alternatives during the exploration of an expressed idea? I think one is already difficult, two or more are perhaps impossible without taking notes.

Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 2025 Thomas C. Veatch. All rights reserved.
Created: May 5, 2025