Introduction

I told George, I'm not that smart, but I'm somewhat determined, and so I'm getting to understand you a little. After some study I can report that I'm glad I had your personal conversation to guide me because in the writing the lessons come in contemplating the result at the end, so please spend more time talking about the basic basics, the tautological foundations, and how they inform the strategic direction.

George's work can be viewed as seeing how we can push the idea of a string of bits. This powers the Black White Array, as well as Dimensional Shuffle Transform patents, the (Communication-) Optimal Mapping paper, Or similarly, CLS: these are the three ultimate, basic, and profound elements of computation: (C) communication, (L) local calculation, and (S) structure manipulation. Curiously, once specified, their mere juxtaposition falls into these sort of low-energy crystalline or even molecular structures which practically and optimally accomplish high-level design intentions, which seem like magic clockworks that are too complex to understand, except everything was easy and clear to specify and works together naturally. It's as though you have forced me mysteriously to finally achieve Clear Thinking!

I remember as a high school student learning to think analytically. Being clear about what I am describing, means taking it apart into its components, and showing how they fit together. Then I really know the thing. This approach is quite similar but in the domain of computation: the WAY I can be clear about what I am saying is by taking it apart into its components, which are these three CLS, and having expressed those parts clearly, the way to put them back together is so general it might solve every problem. Maybe it does.

Bits and CLS are ideas that are, on the one hand, so obvious, and on the other so powerful, and yet again so underappreciated and underutilized, that they deserve emphasis. So let me repeat what you can find here and there about the basic constitutive elements of computation being (C) communication, (L) local calculation, and (S) structure manipulation.

Each has nothing of the others in it.
All computation consists of these and nothing more than these.
"C" communication is free of calculation or structure manipulation: just send the data itself from sender to recipient, no calculations in transit, no structure modification either.
"L" local calculation is free of communication; everything needed is directly present.
"S" structure manipulation in the sense of dividing and combining vectors doesn't move data, doesn't calculate anything, just asserts that you can look at this stick as being divided into two, or those two as being laid end to end.

We will look at a detailed example, $scan\oplus$, hopefully you'll see the theme: these constituents are explicitly separated, simply defined, and put together clearly, enabling a powerful efficient result.

Let's begin.

The Algorithm

Algorithm 4.5 of George Mou's 1990 dissertation (henceforth "GM90") took me some work to go through. Here I'm going to go through it again saying no more nor less but the same, however giving more steps and detail so you may be able to follow it more easily than I did my first time through.

The algorithm says, and I quote (replacing $id$ with $f_b$, and , with . in certain cases):

$$\begin{align} scan \oplus &= PDC(d_{lr},c_{lr}, id, (h_{scan} \oplus), atom?, f_b)\\ where f_b\ x\ &=\ x.x \\ h_{scan} &= (!loc_1 \oplus, !loc_2 \oplus)\ :\ \#\ !\ corr \\ loc_1 ((x_1.s_1),(x_2.s_2)) &= (x_1 . (s_1\oplus s_2)) \\ loc_2 ((x_1.s_1),(x_2.s_2)) &= ((x_1 \oplus s_2) . (s_1\oplus s_2))\\ \end{align}$$

Definitions

This algorithm arequires quite a few definitions to make sense of it.

Someone might say, Look it up. Or, having read it and understood it, you should not need to look it up.

George would say, if you still don't understand it, at least trust it as an authority that is speaking truth, and do the work needed so you do understand it. Of course it may take some study.

I say, I can't remember where I read it, and wasn't sure I understood it then or at least certainly not now, so I'll say it again here and try to put in a link to the original statement also in case of ambiguity. Forgive my redundancy, this discussion repeats the ideas in different sources, mainly in GM90.

scan is the scan algorithm, which scrolls across an array or vector summing up (or summarizing, using $\oplus$) everything it has seen so far. So $scan + (1,4,30) = (1,5,35)$. scan is a canonical, simple, and clear case of a broad class of problems where you think you can't do the next thing until after you've finished the previous, so you think you must do it one at a time, which is to say, slowly, in O(n) time, rather than splitting the problem into parts which can separately be processed, which could be done in parallel in O(log(n)) time. If $n=2^{10}$, then since $log_2(n) = 10$ such a task would require 10/1024 or 1% of the calculation time to do it in parallel instead of sequentially. This is typical.
PDC(divide $=d$, combine $=c$, preadjust $=pre$, postadjust $=post$, basepred $=p_b$, basefunc $=f_b$) is for parallel divide-and-conquer. Give it 6 simple functions, keeping separate the fundamental elements of computation, namely communication, calculation, and structure manipulation, and it gives you back a parallel program.
The general Divacon parallel program can be expressed like this:
$$\begin{align} f = (\ \ p_b\ ? &\ f_b\\ \text{else}\ &\ c : post :\ !\ f : pre : d )\\ \end{align}$$
- Here $c$ and $d$ are combine and divide, which are (SM) structure manipulation functions that operate on vectors and do things like cutting a stick in half, or putting two halves together. In reality (inside the compiler or interpreter) they don't move anything anywhere but rejigger the index mapping so you think you are looking at multiple subvectors. For example, a left-right subdivide would take the high bit of the sequence of $2^a$ indexes numbered $0..2^a-1$ and refer to the subsequence having high bit as 0 as the left subvector, and the subsequence of elements whose indexes have high bit as 1 as the right subvector. Or, an even odd subdivide would take the low bit of the index sequence and refer to the subsequence having low bit as 0 the even vector, and the subsequence of indexes having low bit as 1, call it the odd vector. Actually the compiler or interpreter will just keep track of how you are thinking about your indexes, and move nothing, thus taking O(1) total time.
- Then $pre$ and $post$ are adjust functions which comprise (C) communication followed by (L) local operation functions which together adjust the array or array members, on the one hand, to be ready for $f$ itself to apply to each of them (in the case of $pre$), or to be ready for the combine function to merge two vectors of adjusted outputs for the next level to work on (in the case of post). Read more about them here.
- $f = ... !f ... $ is the essence of parallel decomposition. "!" pronounced "bang" and referred to as "distribution" means to apply some function separately to each of a sequence of different inputs. So this means the function $f$ includes applying itself to smaller parts, potentially all at the same time.
- basepred and basefunc are, respectively, (1) how you know, after dividing the input over and over again, that you can't divide it any more, and (2) what to do to get over into the output domain for a single value. They are generally simple. Even non-operations, like a change in units or interpretation of the value, might be all you need. But it is at the base, the leaves of the division tree, where the universe or meaning of the kind of data we are looking at changes from the input data kind to the output data kind.
Hopefully that made it simple. The point is, when you think about complex problems using these particular simple terms, then you can solve them just like that. Because these are the true essence of computation: communication, (local) calculation, and structure manipulation. Remember PDC.
id is a no-op: $id\ x = x$.
atom? is true on the leaves of the division tree. $(atom?\ a)$ is true if $a$ is a singleton or 1-D vector, or a single element or value; or false if $a$ is a vector with more than one element.
$c_{lr}, d_{lr}$ are the Left-Right combine and divide operations:

$d_{lr}$: Split an array into two comprising left and right halves. "Cut a stick in half."
$c_{lr}$: Join two arrays into one by placing one on the left, the other on the right. "Put two sticks end to end."
$\oplus$: usually just '+', but sometimes it could be another summarizing type of operation. We care that it is associative.

But we do not care if it is or is not commutative, because all the activity inside this PDC will not be reordering the data sequence only the operation sequence. So the operator does not need to be commutative.
George says he used to raise his hand at conferences when speakers said "associative and commutative" in such a case, and declare the statement incorrect if an algorithm required an associative but not necessarily commutative operator. After some years being ignored, he stopped raising his hand.
His point being, an algorithm should be stated at the correct level of abstraction, without unnecessary conditions. Adding the restriction of commutativity, therefore, is actually wrong, rather than merely failing to be even more general. You might say, that's a bit pedantic, and maybe so. On the other hand, as I have read, George lost friends in the Cultural Revolution to execution and imprisonment because they said a little too much. In honor of them, let's try to make our words count, then, okay?
George says I make a habit of interpolating unnecessary conditions, so please help me find such errors in these pages, won't you?
$f_b$ is the base function. Here it transitions from the domain of values to the double-domain of values(1)-with-scan-results(2). At every leaf we haven't done anything to scan anything but already by default our own value is the scan of everything within this leaf, namely just our own value. Could it possibly be simpler? No.

Discussion

Now we are almost in the essence of the problem. Let me blather for a bit and then maybe you can just read the original statement and it'll suddenly click. Okay?

$scan \oplus$ here is a postadjust algorithm, meaning that all the action is in the combining phase which takes pairs of vectors to combined vectors at each next level. Also, it means that the basic data in all the vectors will be the output data type. This is an explicit conversion done by $f_b$ at the leaves of the division tree. But it also must remain so at each level of combining partial result vector pairs on the way up.

In designing a parallel algorithm, this is the discipline: make sure that the properties you rely upon to do a calculation continue to hold at the next level. That's why preadjust and postadjust functions are needed, not just to do local calculations on communicated and local data, but also to ensure that the outcome for every member is of the same type as at the previous level. Following that discipline means we can then reason clearly and logically up and down the tree and feel confident in conclusions that the whole thing will indeed operate as intended and yield the result we want.

The property we want to maintain is that each element of each combining vector has the combined vector's final value, namely its scan result (2) along with its local element value (1).

How could this be done? It seems we can't do it at all -- without magic -- since the scan result is only discovered at the end after doing a whole iterated scan of any vector. But No!, There is a magic trick because all elements in each combining vector have its end-of-vector scan result (2) already there with it, and by communicating with the corresponding element of the other vector, it can locally calculate the scan sum which will emerge at the end of the combined vector. The scan result at the end of two joined vectors is the sum of the independent scan results at the end of the each! So we sum together the two vector-final scan sums, everywhere in both the combining vectors. Magic!

Indeed, every element in the combined vector can calculate that same final sum value locally, because in the first combining vector every element (1) already has with it its last value (2), and it receives from its corresponding element in the other half the last value of the second half(2). Symmetrically, in the second combining vector each element already has the value summarizing the second combining vector (2), and it receives from its corresponding element in the first combining vector its summary of the first (2). Add them, poof, you have the final scan result for the combined vector.

Calculating these identical sums for each element in both halves of the combining vector is a lot of redundant calculation, but the point is it requires no further C communication after the correspondent communication sends just one element across from each to each. (Conveniently $\#!corr$ can be done in parallel without delays or conflicts in transit.) In the result, sure enough, every element in a vector, as mentioned above regarding $f_b$, is in the output domain which is double, one may write it as $(x.s)$, with both (1) element-local values, $x$, and (2) vector-final scan results $s$. Each element thus contains its own scan value (1), and also it contains the vector-final scan value (2), which it locally calculated by making use of the communicated scan values from the other combining vector.

Does this explanation help? I had to draw it out in detail to see how it works, and to be comfortable that the data flows feedforward in a way that is stable, and the claims made at one level remain true at the next.

Remember, all the members of the output type vector get a local calculation; in the case of (2) it yields the same result for each member of both combining vectors, and so all members of the vector contain the combined vector's scan result (2). They don't need to receive more communications about the scan result for the combined vector, because they had enough to calculate it themselves: the scan result for one combining vector, and the communicated scan result from the other combining vector. Summing those two yields the final scan result for the about-to-be-joined vector. Was that magic? Only until I saw it!

One point of mystery to me was, how does this work at one and two levels up from $f_b$? Combining two singleton vectors $ [ x_1.s_1 ] $ and $ [ x_2.s_2 ] $, do we capture the whole scan by summing $s_1 + s_2$? Obviously we do because this is the normal case of one-by-one scanning.

What about two 2-member vectors $ [ \lt x_1.s_1 \gt, \lt x_2.s_2 \gt ] $ and $ [ \lt x_3.s_3 \gt, \lt x_4.s_4 \gt ] $? How is it that the final of the second vector becomes the $s$ element everywhere? Well first observe that already $s_1 = s_2 = x_2$ and $s_3 = s_4 = x_4 $ because every vector member contains the vector's final value. Second, we can work it out with paper and pencil, it's very simple to see, that summing the finals $s_2 + s_4$ sums everything. Imagine an element in the first vector was increased by some number $a$. Won't every element after it including every element in the second vector be increased by $a$? Yes. So the sum at the end of one vector influences the outcomes across the second vector. By how much? By that sum. We do capture the whole scan result of both vectors together by just summing $s_2 + s_4$, because each final contains their whole vector's result so far. When we add $s_2$ to all the elements in the second vector (which we do since in one case we add $s_1=s_2$ to the first element of the second vector and the other case we add $s_2$ to the second element of the second vector) we distribute its effect across the whole second vector. And vice versa, from the second vector to the first, so now each element in both vectors has the final sum of the joined vector. Doing this repeatedly at every level maintains the proposition that each element contains its vector's final element even as we build up larger levels.

Now we are really ready to define "Local", those local calculations, of which there are two types: $loc_1$ and $loc_2$. There are two sides of the communication, which is bidirectional. Both sides add up their own with the other fellow's temporary scan results and store the sum (separately calculated, but identical in value) into each's scan result spot (2). And one side is considered the low side, so it keeps its low value (1), all done. The other side, the high side, because it is scanning on top of the previous sums, will add in the low-side's scan-result(2) to its local value (1), that is its new local scan value (1).

That is the idea behind the $loc_1$ and $loc_2$ functions, which put identical sums into position (2), and on the low side just leave $x_1$ there, but on the high side actually perform the next-level, incremental, full-vector-pair scan operation by adding the final-results-so-far from the other side to itself.

So we put all this extra structure around the scan operation, the division and combination, the basepredicate and the utmost-simple basefunc into the output double-domain, and we postadjust the results by updating partial vector final scan values locally to every element, and reducing the actually-sequential scan operation to one bit inside $h_{scan}$ namely "$x_1 \oplus s_2$" -- and even that is a recursive and subdivided summarizing operation, so that really, everything is summaries moving around, and they accumulate half-vectors at a time, into full-vectors.

You can skip the following detailed, annotated divacon computation graph if you feel you understand this. I really struggled to understand it, even though afterward it doesn't seem so complicated, so I did this to see each step. Let's follow GM90 Figure 4.7 and calculate scan + (0 1 2 3 4 5 6 7):

[ 0 1 2 3 4 5 6 7 ]								$f = scan \oplus$ on this input first calls $d_{lr}$ #1:
[ 0 1 2 3 ]				[ 4 5 6 7 ]				Then $!scan \oplus$ applies, thus 2x of
[ 0 1 ]		[ 2 3 ]		[ 4 5 ]		[ 6 7 ]		$d_{lr}$ #2, & of $! scan \oplus$ again, thus 2 each more of:
[ 0 ]	[ 1 ]	[ 2 ]	[ 3 ]	[ 4 ]	[ 5 ]	[ 6 ]	[ 7 ]	$d_{lr}$ #3. Now $! f$ checks $atom?$: true.
[ 0.0 ]	[ 1.1 ]	[ 2.2 ]	[ 3.3 ]	[ 4.4 ]	[ 5.5 ]	[ 6.6 ]	[ 7.7 ]	Apply $f_b$ to singleton vectors
[ (0.1, 1.1) ]	[ (1.1, 0.0) ]	[ (2.2, 3.3) ]	[ (3.3, 2.2) ]	[ (4.4, 5.5) ]	[ (5.5, 4.4) ]	[ (6.6, 7.7) ]	[ (7.7, 6.6) ]	$\#!corr$ takes adjacent pairs of vectors and communicates each to each
$[ (loc_1 +)(0.1, 1.1) ] $	$[ (loc_2 +)(1.1, 0.0) ]$	$[ (loc_1 +)(2.2, 3.3) ] $	$[ (loc_2 +)(3.3, 2.2) ]$	$[ (loc_1 +)(4.4, 5.5) ] $	$[ (loc_2 +)(5.5, 4.4) ]$	$[ (loc_1 +)(6.6, 7.7) ] $	$[ (loc_2 +)(7.7, 6.6) ]$	distribute one $loc +$ function per 4#'s.
$ [ 0.(0+1) ] $	$[ (1+0).(1+0)]$	$ [ 2.(2+3) ] $	$[ (3+2).(3+2)]$	$ [ 4.(4+5) ] $	$[ (5+4).(5+4)]$	$ [ 6.(6+7) ] $	$[ (7+6).(7+6)]$	Local calculations
$[ 0.1 ]$	$[ 1.1 ]$	$[ 2.5 ]$	$[ 5.5 ])$	$[ 4.9 ]$	$[ 9.9 ]$	$[ 6.13 ]$	$[ 13.13 ]$	Local calculations
$[ 0.1, 1.1 ]$		$[ 2.5, 5.5 ]$		$[ 4.9, 9.9 ]$		$[ 6.13, 13.13 ]$		$c_{lr}$ #3. Level #3 is done.
$[ (0.1, 2.5), (1.1, 5.5) ]$		$ [ (2.5, 0.1), (5.5, 1.1) ]$		$[ (4.9, 6.13), (9.9, 13.13) ]$		$ [ (6.13, 4.9), (13.13,9.9) ]$		$post$ starts with $\#!corr$ communication between array pairs
$(!loc_1 +)[ (0.1, 2.5), (1.1, 5.5) ]$		$ (!loc_2 + )[ (2.5, 0.1), (5.5, 1.1) ]$		$(!loc_1 +)[ (4.9, 6.13), (9.9, 13.13) ]$		$ (!loc_2 +)[ (6.13, 4.9), (13.13,9.9) ]$		$(!loc_1+, !loc_2+)$ distributes each $loc$ function over one of a pair of adjacent vectors.
$[ (loc_1 +)(0.1, 2.5), (loc_1 +)(1.1, 5.5) ]$		$ [ (loc_2 +)(2.5, 0.1), (loc_2 +)(5.5, 1.1) ]$		$[ (loc_1 +)(4.9, 6.13), (loc_1 +)(9.9, 13.13) ]$		$ [ (loc_2 +)(6.13, 4.9), (loc_2 +)(13.13,9.9) ]$		here they are seen distributed.
$[ 0.(1+5), 1.(1+5) ]$		$ [ (2+1).(5+1), (5+1).(5+1) ]$		$[ 4.(9+13), 9.(9+13) ]$		$ [ (6+9).(13+9), (13+9).(13+9) ]$		Apply the formulas
$[ 0.6, 1.6 ]$		$ [ 3.6, 6.6 ]$		$[ 4.22, 9.22 ]$		$ [ 15.22, 22.22 ]$		Do the arithmetic
$[ 0.6, 1.6, 3.6, 6.6 ]$				$ [ 4.22, 9.22, 15.22, 22.22 ]$				$c_{lr}$ #2. Level #2 is done. One more $post$ and $c_{lr}$.
$[ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]$				$ [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]$				$\#!corr$ between 4-element arrays
$(!loc_1+)[ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]$				$(!loc_2+) [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]$				Apply $!loc_1+$ to the first, $!loc_2+$ to the second
$[ (loc_1+)(0.6,4.22), (loc_1+)(1.6,9.22), (loc_1+)(3.6,15.22), (loc_1+)(6.6,22.22) ]$				$ [ (loc_2+)(4.22,0.6), (loc_2+)(9.22, 1.6), (loc_2+)(15.22, 3.6), (loc_2+)(22.22, 6.6) ]$				Distributed
$[ 0.28, 1.28, 3.28, 6.28 ]$				$ [ 10.28, 15.28, 21.28. 28.28 ]$				Carry out the local calculations
$[ 0.28, 1.28, 3.28, 6.28, 10.28, 15.28, 21.28. 28.28 ]$								$c_{lr}$ #1. Done!
$28$								You might want to pull out an array of values(1), or only the last value; whatever you like.

The output is a vector of dotted elements, not a single number; so that bit of bookkeeping seems to be a detail left outside of the statement of the algorithm.

Did you notice how the function stack, built up by the time of $f_b$, is a lot of promises carried out in the rest of the process? Good.

Communication without Broadcast

Did you notice that $corr$ and $(f,f)$ apply to pairs of vectors? Division doesn't mean that calculations cannot recieve communications from other vectors. It's not that you keep them nearby, but that conceptually you know what vectors pair together and that you can communicate between them, to get their possibly remote data into the local vector. We let communication handle a lot here.

In fact, in Divacon, communication is a first-class part of the language. In no other high-level computer language is this the case. Given that computation is all and only Local calculation, Communication, and Structure manipulation, the three orthogonal components of computation, it is amazing that noone has surfaced Communication, leaving it always to compilers and interpreters and unexpressed in statements of algorithms. When Communication is made explicit in the Divacon language, calculating its costs and optimizing its operation are made possible. Not only is it difficult to think about what one never expresses, but when the compiler and chip designer has no hint about how to do Communication within an algorithm, the result is that the most cumbersome, turn-taking, blocking, all-to-all, general-but-unbelieveably-slow methods are used. Whereas we have seen here in Algorithm 4.5 communications which are:

recursive (as part of the computation tree)
thin (small data packages sent),
efficient (no recipient gets more than they need; all bullets land),

lockstep, therefore

are also non-blocking (correspondent communications can be done all at once between corresponding pairs, there is no blocking; see the Optimal Mapping paper), and
can be simultaneous, thus parallelizeable,
optimizeably local.

Thus communications can be made very low in time cost.

Considering that interprocessor communications can be 10x or 1000x the time cost of a local calculation, not to mention the resulting burdens and costs and complexities of multi-layered cache memory architectures, we really have to pay attention to computational intensity: the ratio between local calculations and communications. Computational intensity must be thought about and optimized, and the main solution to that is to have a communication model that can recursively decompose $O(n)$ expensive (and, OMG, serial!) communications into $O(log(n))$ cheap parallel communications, like $\#!corr$.

Broadcast Dissolving

George points out that Algorithm 4.5 exemplifies "Broadcast Dissolving. which in fact can be applied to many other problems." If I may be more specific, Broadcast Dissolving means log(n) parallel correspondent communication phases, whole-vector to whole-vector all element-pairwise in parallel, each followed by local calculations that yield an array-wide shared value separately calculated and stored with each member of each array. For example, notice (1+5), (1+5), (5+1), (5+1) just above $c_{lr}$ #2 in the above example, each yielding 6. Then each arrives at a shared outcome without ever actually communicating 1-to-many. Broadcast, the slowest communication pattern, is replaced by correspondent communication, which can be done all in one step at each of log(n) levels, without interference or mutual blocking. Since communication time is much greater than local calculation time, and since the redundant local calculations are done in parallel -- simultaneously -- we vastly reduce algorithm time cost through Broadcast Dissolving.

Thank you for your attention, and any comments or corrections are appreciated.

Bibliography

Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795.

Z. George Mou and Xiaojing Wang, Optimal Mappings of m Dimensional FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp 104-119. in PARLE '93, Parallel Architectures and Languages Europe, Lecture Notes in Computer Science #694, Springer-Verlag. 5th International PARLE Conference, Munich, Germany, June 1993 Proceedings.

[ 0 1 2 3 4 5 6 7 ]								\(f = scan \oplus\) on this input first calls \(d_{lr}\) #1:
[ 0 1 2 3 ]				[ 4 5 6 7 ]				Then \(!scan \oplus\) applies, thus 2x of
[ 0 1 ]		[ 2 3 ]		[ 4 5 ]		[ 6 7 ]		\(d_{lr}\) #2, & of \(! scan \oplus\) again, thus 2 each more of:
[ 0 ]	[ 1 ]	[ 2 ]	[ 3 ]	[ 4 ]	[ 5 ]	[ 6 ]	[ 7 ]	\(d_{lr}\) #3. Now \(! f\) checks \(atom?\): true.
[ 0.0 ]	[ 1.1 ]	[ 2.2 ]	[ 3.3 ]	[ 4.4 ]	[ 5.5 ]	[ 6.6 ]	[ 7.7 ]	Apply \(f_b\) to singleton vectors
[ (0.1, 1.1) ]	[ (1.1, 0.0) ]	[ (2.2, 3.3) ]	[ (3.3, 2.2) ]	[ (4.4, 5.5) ]	[ (5.5, 4.4) ]	[ (6.6, 7.7) ]	[ (7.7, 6.6) ]	\(\#!corr\) takes adjacent pairs of vectors and communicates each to each
\([ (loc_1 +)(0.1, 1.1) ] \)	\([ (loc_2 +)(1.1, 0.0) ]\)	\([ (loc_1 +)(2.2, 3.3) ] \)	\([ (loc_2 +)(3.3, 2.2) ]\)	\([ (loc_1 +)(4.4, 5.5) ] \)	\([ (loc_2 +)(5.5, 4.4) ]\)	\([ (loc_1 +)(6.6, 7.7) ] \)	\([ (loc_2 +)(7.7, 6.6) ]\)	distribute one \(loc +\) function per 4#'s.
\( [ 0.(0+1) ] \)	\([ (1+0).(1+0)]\)	\( [ 2.(2+3) ] \)	\([ (3+2).(3+2)]\)	\( [ 4.(4+5) ] \)	\([ (5+4).(5+4)]\)	\( [ 6.(6+7) ] \)	\([ (7+6).(7+6)]\)	Local calculations
\([ 0.1 ]\)	\([ 1.1 ]\)	\([ 2.5 ]\)	\([ 5.5 ])\)	\([ 4.9 ]\)	\([ 9.9 ]\)	\([ 6.13 ]\)	\([ 13.13 ]\)	Local calculations
\([ 0.1, 1.1 ]\)		\([ 2.5, 5.5 ]\)		\([ 4.9, 9.9 ]\)		\([ 6.13, 13.13 ]\)		\(c_{lr}\) #3. Level #3 is done.
\([ (0.1, 2.5), (1.1, 5.5) ]\)		\( [ (2.5, 0.1), (5.5, 1.1) ]\)		\([ (4.9, 6.13), (9.9, 13.13) ]\)		\( [ (6.13, 4.9), (13.13,9.9) ]\)		\(post\) starts with \(\#!corr\) communication between array pairs
\((!loc_1 +)[ (0.1, 2.5), (1.1, 5.5) ]\)		\( (!loc_2 + )[ (2.5, 0.1), (5.5, 1.1) ]\)		\((!loc_1 +)[ (4.9, 6.13), (9.9, 13.13) ]\)		\( (!loc_2 +)[ (6.13, 4.9), (13.13,9.9) ]\)		\((!loc_1+, !loc_2+)\) distributes each \(loc\) function over one of a pair of adjacent vectors.
\([ (loc_1 +)(0.1, 2.5), (loc_1 +)(1.1, 5.5) ]\)		\( [ (loc_2 +)(2.5, 0.1), (loc_2 +)(5.5, 1.1) ]\)		\([ (loc_1 +)(4.9, 6.13), (loc_1 +)(9.9, 13.13) ]\)		\( [ (loc_2 +)(6.13, 4.9), (loc_2 +)(13.13,9.9) ]\)		here they are seen distributed.
\([ 0.(1+5), 1.(1+5) ]\)		\( [ (2+1).(5+1), (5+1).(5+1) ]\)		\([ 4.(9+13), 9.(9+13) ]\)		\( [ (6+9).(13+9), (13+9).(13+9) ]\)		Apply the formulas
\([ 0.6, 1.6 ]\)		\( [ 3.6, 6.6 ]\)		\([ 4.22, 9.22 ]\)		\( [ 15.22, 22.22 ]\)		Do the arithmetic
\([ 0.6, 1.6, 3.6, 6.6 ]\)				\( [ 4.22, 9.22, 15.22, 22.22 ]\)				\(c_{lr}\) #2. Level #2 is done. One more \(post\) and \(c_{lr}\).
\([ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]\)				\( [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]\)				\(\#!corr\) between 4-element arrays
\((!loc_1+)[ (0.6,4.22), (1.6,9.22), (3.6,15.22), (6.6,22.22) ]\)				\((!loc_2+) [ (4.22,0.6), (9.22, 1.6), (15.22, 3.6), (22.22, 6.6) ]\)				Apply \(!loc_1+\) to the first, \(!loc_2+\) to the second
\([ (loc_1+)(0.6,4.22), (loc_1+)(1.6,9.22), (loc_1+)(3.6,15.22), (loc_1+)(6.6,22.22) ]\)				\( [ (loc_2+)(4.22,0.6), (loc_2+)(9.22, 1.6), (loc_2+)(15.22, 3.6), (loc_2+)(22.22, 6.6) ]\)				Distributed
\([ 0.28, 1.28, 3.28, 6.28 ]\)				\( [ 10.28, 15.28, 21.28. 28.28 ]\)				Carry out the local calculations
\([ 0.28, 1.28, 3.28, 6.28, 10.28, 15.28, 21.28. 28.28 ]\)								\(c_{lr}\) #1. Done!
\(28\)								You might want to pull out an array of values(1), or only the last value; whatever you like.

	Your thoughts?
Name⁺:
Email^*:	(will not be shared or abused)
Comment:
	Feedback is welcome.

Algorithm 4.5: Scan without Broadcast

Introduction

The Algorithm

Definitions

Discussion

Communication without Broadcast

Broadcast Dissolving

Bibliography

Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: July 22, 2024

Introduction

The Algorithm

Definitions

Discussion

Communication without Broadcast

Broadcast Dissolving

Bibliography

Copyright © 2024 Thomas C. Veatch. All rights reserved. Created: July 22, 2024

Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: July 22, 2024