Scan with broadcast depends on \(O(N)\) communication delays at
each step, to carry out the broadcast to \(N\) recipients. These delays
might be dissolved into the \(O(log(N))\) steps of a Divacon program, if
we could locally calculate, within each side of a divided array pair
H0 and H1, the value of "last 0" which would otherwise require
\(O(N)\)-time communications. Can we send to all of H0 the last of H0,
and to all of H1 the last of H1, by parallel calculations instead of
serial communications?
Yes. The recursion has a sound base case and a sound recursion rule.
- In the base case, a singleton array contains its scan sum, the
sum of all the elements in the array, because the sum of one
value is that value itself.
- In the recursive case, given sub-arrays H0 and H1
each of whose elements contain the sub-array-final scan sum,
every element in the combined array can be made to contain the
combined-array-final scan sum, by adding together those two scan
sums. The other side's scan sum can be gotten from any element
on the other side, thus correspondent communication (or mirror,
or any other lockstep communication between two arrays) will do
it.
Now that the scan sum of the H0 array is available everywhere in H1,
we can add it to each element of H1. This yields an inside-out scan
addition sequence. Instead of adding left to right one by one all the
way across, we can do it a whole array at a time, by adjusting the
entire, H1, results-so-far array by adding to each element a summary constant
from the H0 array. H1 was already a self-contained scan sequence;
adding the H0 constant to each one lifts H1 into position as the scan
successor of H0. Did it make sense?
To make this fully concrete, here are algorithm definitions and an
example computation graph with discussion following. For the
elements not here defined, \(:\ \#\ !\ d\ c\ lr\ corr\ atom?\ f_b\),
see the Tutorial.
|
Scan (without broadcast) |
|
|
\(scan\ =\ !self : PDC(d_{lr},c_{lr},id,(!LO_0, !LO_1) : \#\ !\ corr, atom?, f_b)\)
|
|
|
where:
\(f_b\ a = a\ . a \) Each gets a copy of itself at the bottom.
\(LO_{0}\ (a0.b0)\ .(a1.b1) = a0\ . b0+b1\) Local Op within LHS (array 0) just copies itself
but also locally calculates scan sum for doubled subarray.
\(LO_{1}\ (a0.b0)\ .(a1.b1) = a0+b1\ . b0+b1\) Local Op within RHS (array 1) adds LHS scan sum, \(b1\); but also
makes scan sum for doubled array.
|
|
Sample Input |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
preadjust=id so keep dividing |
\(d_{lr}\) 3x |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
stopped at atom?=true |
\(f_b\) |
0.0 |
1.1 |
2.2 |
3.3 |
4.4 |
5.5 |
6.6 |
7.7 |
output data type is a pair, self.sum |
# ! corr (1st) |
(0.0).(1.1) |
(1.1).(0.0) |
(2.2).(3.3) |
(3.3).(2.2) |
(4.4).(5.5) |
(5.5).(4.4) |
(6.6). (7.7) |
(7.7). (6.6) |
each got corr's contents |
\((!LO_0,!LO_1)\) |
0.0+1 =0.1 |
1+0.1+0 =1.1 |
2.2+3 =2.5 |
3+2.3+2 =5.5 |
4.4+5 =4.9 |
5+4.5+4 =9.9 |
6.6+7 =6.13 |
7+6.7+6 =13.13 |
LHS: id. RHS: sum. Next: \(c_{lr}\) |
\(c_{lr}\)(1st) |
0.1 |
1.1 |
2.5 |
5.5 |
4.9 |
9.9 |
6.13 |
13.13 |
next is # ! corr |
# ! corr (2nd) |
(0.1).(2.5) |
(1.1).(5.5) |
(2.5).(0.1) |
(5.5).(1.1) |
(4.9).(6.13) |
(9.9). (13.13) |
(6.13). (4.9) |
(13.13). (9.9) |
each sent self to corr |
\((!LO_0,!LO_1)\) |
0.1+5 =0.6 |
1.1+5 =1.6 |
2+1.5+1 =3.6 |
5+1.5+1 =6.6 |
4.9+13 =4.22 |
9.9+13 =9.22 |
6+9.13+9 =15.22 |
13+9.13+9 =22.22 |
ready for 2nd \(c_{lr}\) |
\(c_{lr} (2nd)\) |
0.6 |
1.6 |
3.6 |
6.6 |
4.22 |
9.22 |
15.22 |
22.22 |
now it's 4-arrays. |
# ! corr (3rd) |
(0.6).(4.22) |
(1.6).(9.22) |
(3.6).(15.22) |
(6.6).(22.22) |
(4.22).(0.6) |
(9.22). (1.6) |
(15.22).(3.6) |
(22.22). (6.6) |
(C)ommunicate, (L)ocally calculate |
\((!LO_0,!LO_1)\) |
0.6+22 =0.28 |
1.6+22 =1.28 |
3.6+22 =3.28 |
6.6+22 =6.28 |
4+6.22+6 =10.28 |
9+6.22+6 =15.28 |
15+6.22+6 =21.28 |
22+6.22+6 =28.28 |
(S)tructure manipulate (\(c_{lr}\)) |
\(c_{lr}\) (3rd) yields an 8-array |
0.28 |
1.28 |
3.28 |
6.28 |
10.28 |
15.28 |
21.28 |
28.28 |
This is the PDC output. Last, drop \(other\). |
\(! self\) |
0 |
1 |
3 |
6 |
10 |
15 |
21 |
28 |
All done |
In case you are tracking with George's thesis, Mou 1990 p84, this
presentation differs from his immaterially and for perhaps personal
reasons. See Footnote below.
The invariant fact in this algorithm is that every level's output-side
array, that is, both after \(f_b\) and after \(c_{lr}\), in every resulting pair
\(a\ . b\), the value \(b\) is always the scan-sum of the whole array,
even though it is separately calculated for each position. This
miracle remains true because the scan sums on left and right arrays
before combination, which are redundantly added in each position,
themselves sum to the scan sum of the combined array.
At each of three levels, the array sizes divide in half until base
arrays have only one element.
I like to use the H0 H1 naming convention. In general a division
turns a full length array, call it F, into two half-length arrays, H0
(on the left) and H1 (on the right), and combination turns two
half-length arrays, H0 (on the left) and H1 (on the right), into a
full length array, F. Whether after dividing or before combining, any
communication that occurs, when it occurs, occurs between H0 and H1.
This pattern holds at every level of division and every level of
combination, no matter what the size of the top level input array and
how deep is the computation graph.
The 'last0' communication pattern says take the last element of H0 and
send it to each and every element in H1. Since it is unidirectional
there is no '!'; however if you wanted to also send from the last of
H1 to all of H0, you could perhaps write the communication pattern as
"# ! last" which reads as "communicate bidirectionally from the last of
each half to all members of the other half".
Although this "scan with broadcast" version is easy to write and easy
to understand unrolled, there is a hidden cost here in that "last 0"
is not a task that can be done independently in all the positions at
once, assuming a serial processor in each location. No, it has to
send to the first, then to the second, then to the third, etc. so the
total time cost is O(N) with N the size of the communicating
sub-arrays, which halves at each level of division (in the case of
pre-communication) and doubles at each level of combination (in the
case of post-communication), so adding those all up using the
applicable triangle area formula we get O(N*log(N)/2) which further
reduces to O(N*log(N)) by the funny rules of Order-of-Magnitude
reasoning. One would prefer it to be O(log(N)): Can we do it? Yes.
See Scan W/O Broadcast.
Footnote
The computation graph differs from Mou 1990 p84 in four ways:
- (1) I replaced the \(h_{scan}\) function call in the 4th argument of
PDC() with its expansion, to reduce the number of
uncomprehended layers I need to consider in my mind at once.
- (2) I renamed his functions \(loc1 +\) and \(loc2 +\) as \(LO_0\) and
\(LO_1\) because I couldn't tell whether \(loc\) was intended to
refer to a \(location\) or a \(local\ operation\), both being
consistent with the text and each leaving a mental question mark
in the mind when reading it. I discovered the latter is correct
when George did this as a pencil exercise with me in person.
Anyway this function naming is unambiguous.
- (3) I renamed his variables \((x,sum) (x1,sum1),(x2,sum2)\) as
\(a0.b0, a1.b1\) because \(sum\) is ambiguous between (1) the
scan sum at the end of the array, (2) the scan sum for the whole
array calculated redundantly at each point in the array, (3) the
scan sum taken from the other array, and (4) the running sum that
builds up from left to right as one does the scan operation in
one's mind. Although these are related, the path through this
thicket is clear if (2) is taken as correct. The \(a\ . b\) notation
is less ambiguous to the learner, though also less evocative.
- (4) I expanded the output display to include additional output lines
after \(\# ! corr\) and after \((!LO_0,!LO_1)\). While producing
a much busier computation diagram, this helped me to see the
concrete obviousness of every calculation, during the phase
of getting the idea.
Is it personal? One must get comfortable with both the algorithm's
calculations and the scope and meaning of the different functions.
The mind scans back and forth trying out different, possible, joint
interpretations, seeking the comforting redundancy of confirmation of
trial answers, which indicates "I understand".
In the face of those four layers of nesting-complexity, uncertainty,
possible misdirection, and information-hiding, I was unable to
penetrate this without George's personal intervention with a
watch-over-my-shoulder pencil exercise (for scan
with broadcast), during which references became clear and the
whole thing rather obvious.
My personal struggle with this sheds a bright and helpful light on the
struggles of students learning literacy especially
using letter names in English spelling (ABC's). Without
concrete reliability of interpretation for the written elements of
communication, whether letters in English or functions, variables, and
operators in math, even a short sentence or equation rapidly becomes
impenetrable to the multiple-ambiguity-limited human mind. What is
the maximum number of placeholder question marks we can handle in our
minds while trying out alternatives during the exploration of an
expressed idea? I think one is already difficult, two or more are
perhaps impossible without taking notes.
|