Scan with Broadcast

A baby example

On Mou
on Divide and Conquer

Scan applies a summary function like '+' to all the elements from the beginning of a vector up to the current element.

Why such a simple problem, Tom? Addition is a second- or third-grade skill, while growing patterns like repeated addition in the case of scan might demand sixth grade mathematical skills. Assuming we are all well past sixth grade, you might wonder we are we thinking about such a trivial operation? No, it's not because the math is hard.

It's because the numbers and function outputs go flowing around in the array while scan is being calculated, using the same scaffold that solves every other unimaginable parallel programming, and doing an easy example will give us tools to think through problems that involve such data flows.

I was simply unable to draw the flows here until I watched over George's shoulder as he drew it out, so simple, so obvious, yet completely beyond my powers without his hand-holding even after many attempts and hours of struggle. Ask to see my annotated copy of the relevant page of his thesis! So let me save you the trouble I went through, and I do hope it is simple and obvious to you after this.

Scan helps us to learn how to move data around within little parts and then bigger parts, to ultimately get big jobs done. If we can see this as easy, then we may also be able to understand more complex problems using the tools acquired here.

Now, scan can be visualized this by putting all the elements in a scan sum in a column above each element:

n = 0 1 2 3 4 5 6 7
7
6 6
5 5 5
4 4 4 4
3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1 1
+ 0     + 0     + 0     + 0     + 0     + 0     + 0     + 0    
scan(0:n) = 0 1 3 6 10 15 21 28

This view conveniently shows each number in each column copied identically all the way to the right, to the end. That is, not only does each element contain the sum-so-far of all the elements to the left, but also each block contains the sum-so-far (found at the end) of the block to the left.

For example, split the 8 elements into the first 4 and the last 4. Then the left-hand-side vector, 0, 1, 2, 3, can be scanned one-at-a-time to yield 0,1,3,6, and the right hand-side vector 4,5,6,7 can be scanned separately then added on top of the entire first half's final sum which repeats in the lower part of every column. That is, the final column in the LHS is 3+2+1+0, whereas in the RHS every element contains 3+2+1+0. This pattern is true not only at the largest scale but half-scale and quarter-scale and at every scale down to single pairs of elements.

Now if we implement scan by copying the final sum of the LHS to every element of the RHS, that is called using broadcast. Broadcast implies a global delay of \(O(N)\) time steps for the communication. (Scan without broadcast also exists; it finds a way to dissolve that delay within the \(O(log(N))\) steps of the post-adjust function.)

The computation graph below shows the basic communication and calculation pattern using the Divacon orthogonalization of Communication, Local operation, and Structure manipulation, all without the complexity of broadcast dissolving.

These technical terms and PDC elements may be unfamiliar if you have not read the Tutorial. The elements Let me mention what I didn't understand for a while: "last0" means "communicate unidirectionally from the last element of the left (0'th) sub-array to every element of the right (1'st) sub-array."

Scan (incrementally sum) the input array
\(PDC(d_{lr},c_{lr},id,(id,+) : \# last0, atom?, id)\)

Input   0     1     2     3     4     5     6     7   preadjust=id so keep dividing
\(d_{lr}\) 3x 0 1 2 3 4 5 6 7 stopped at atom?=true
# last0 0 1.0 2 3.2 4 5.4 6 7.6 communicated last of LHS array to all of RHS
(id,+) 0 1 2 5 4 9 6 13 LHS: id. RHS: sum. Next: \(c_{lr}\)
\(c_{lr}\)(#1) 0 1 2 5 4 9 6 13 next is (id,other) : # corr
send fwd to corr
# last0 0 1 2.1 5.1 4 9 6.9 13.9 sent each over to its corr
then drop self, keep other
(id,+) 0 1 3 6 4 9 15 22 ready for \(c_{lr}\)
now \(c_{lr}\#2\) 0 1 3 6 4 9 15 22 now it's 4-arrays. Next
(id,+) : # last0
# last0 0 1 3 6 4.6 9.6 15.6 22.6 LHS last sent to RHS all
next is (id,+)
(id,+) 0 1 3 6 10 15 21 28 LHS: no-op. RHS: added local values.
Next \(c_{lr}\)
\(c_{lr}\) (#3) yields an 8-array 0 1 3 6 10 15 21 28 This is the output.
All done

The output shown on the second line shows that the array sizes have been divided in half three times, that is, until each base array has only one element.

To dwell on this a moment, in general, the structure manipulation of division turns a full length array, F, into two half-length arrays, H0 (on the left) and H1 (on the right), whereas the structure manipulation of combination turns two half-length arrays, H0 (on the left) and H1 (on the right), into a full length array, F. Whether after dividing or before combining, any communication that occurs, when it occurs, occurs between H0 and H1. This pattern holds at every level of division and every level of combination, no matter what the size of the top level input array and how deep is the computation graph.

The 'last0' communication pattern says take the last element of H0 and send it to each and every element in H1. Since it is unidirectional there is no '!'; however if you wanted to also send from the last of H1 to all of H0, you could perhaps write the communication pattern as "# ! last" which reads as "communicate bidirectionally from last to all".

Although this "scan with broadcast" version is easy to write and easy to understand unrolled, there is a hidden cost here in that "last 0" is not a task that can be done independently in all the positions at once, assuming a serial processor in each location. No, it has to send to the first, then to the second, then to the third, etc. so the total time cost is O(N) with N the size of the communicating sub-arrays, which halves at each level of division (in the case of pre-communication) and doubles at each level of combination (in the case of post-communication), so adding those all up using the applicable triangle area formula we get O(N*log(N)/2) which further reduces to O(N*log(N)) by the funny rules of Order-of-Magnitude reasoning. One would prefer it to be O(log(N)): Can we do it? See Scan W/O Broadcast.

Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 2025 Thomas C. Veatch. All rights reserved.
Created: May 5, 2025