Parallel Monotonic Sort

p89-90 Mou 1990

On Mou
on Divide and Conquer

Every solution is either trivial or false.
-- Zhijing George Mou.

Algorithm Statement

Here is Mou's Monotonic Sort algorithm, slightly reformatted:
sort = PDC(\(d_{lr}, c_{lr}\), id, !nest : (!min,!max) : #!mirr,      atom?, id)
nest = PDC(\(d_{lr}, c_{lr}\),                 (!min,!max) : #!corr, id, atom?, id)
Inside the postmorphism PDC sort is a premorphism PDC, nest. How does this miracle work?

Look, see, get some intution

It's fun and informative to watch the data flow as follows in an 8-item sort where |V| is the working size of a data vector. Values are shown at beginning, end, and upon a change; otherwise not. Click the triangle to see it.

7 6 5 4 3 2 1 0

Address in V: 000001010011100101110111
Event Outcomes Function Level Event(s)
[7 6 5 4 3 2 1 0] sort |V|=8 begin
[7][6][5][4][3][2][1][0] sort |V|=1 \(d_{lr}\) 3x
[6][7][4][5][2][3][0][1] sort |V|=1 (!min,!max) : !mirr (post)
[6 7][4 5][2 3][0 1] sort |V|=2 \(c_{lr}\)
[5 4][7 6][1 0][3 2] sort |V|=2 (!min,!max) : !mirr (post)
[4][5][6][7][0][1][2][3] nest |V|=1 (!min,!max) : !corr : \(d_{lr}\)(pre)
[4 5][6 7][0 1][2 3] nest |V|=2 \(c_{lr}\) and done in nest
[4 5 6 7][0 1 2 3] sort |V|=4 \(c_{lr}\)
[3 2 1 0][7 6 5 4] sort |V|=4 (!min,!max) : !mirr (post)
[1 0][3 2][5 4][7 6] nest |V|=2 (!min,!max) : !corr : \(d_{lr}\)(pre)
[0][1][2][3][4][5][6][7] nest |V|=1 (!min,!max) : !corr : \(d_{lr}\)
[0 1 2 3][4 5 6 7] nest |V|=4 \(c_{lr}\) 2x, nest done.
[0 1 2 3 4 5 6 7] sort |V|=8 \(c_{lr}\) 1x. sort done.

See how 7 travels, first completing a local-sort within the address low bit at \(|V| = 1\) mirror, then the next higher address bit is made right at \(|V| = 2\) mirror but with 7 in the bottom of its order, just so that nest can clear up the low bit again at \(|V|=1\) corr.

In this way monotonic sort clears the low bit sort task, then pushes higher-half values into the higher half but in reverse order: those travellers are reverse sorted in the higher half, then nest corr moves them by an address changed in its largest bit: in the higher half first (since it's a preadjust), down to the lowest bit in the higher half.

You have the idea of how it flows. Now let's see why it cannot fail.

Technical Explanation and Proof of Correctness

Let's define some terms. First let the input vector be of length \(|V|=2^D\) for some nonnegative integer \(D\). For instance if there are \(1024=2^{10}\) values, then \(D=10\).

Second, let there be a divide operation independent of vector values (hence "polymorphic") that takes a vector of length |V| to two subvectors of length |V|/2. I find it helpful to refer to these as related subvectors; in divacon a combine operation will operate on two related subvectors to make a double-length vector.

Then the address range of \(V\) comprises \(D\) bits, and its ordinal addresses are numbers \(0\) to \(2^{D}-1\). In binary encoding an address is a set of bits \(B = \{ b_{D-1}, b_{D-2}, ... b_2, b_1, b_0 \}\) from most to least significant (leading zeroes retained explicitly).

Such an address range may have its bits permuted by any numbering system such as from \(0\) to \(D-1\) interpreted as from least significant to most significant, that is, low to high. High to low is another permutation, but note that any mapping between \(0..D-1\) and the various \(D\) bits, represented as a permutation of bits \(B), can potentially be used to organize the same operations, as George Mou showed in his Optimal Mappings paper.

The \(D\) bits of an address may be divided on the \(d\)'th bit in the numbering system, \(b_d\in\ {0,1}\). Suppose \(d+e=D\) and we are counting bits from least to most significant. If we know \(d\), then that tells us where we are in a divide and conquer computation graph, in which, generally speaking, two subvectors of some (same, power of 2) size are being divided from or combined into one double-size one, and this is happening with some number of subvector pairs. The depth \(d\) specifies:

  • by the \(D-d-1\) more-significant bits, which pair of subvectors is being looked at at the moment;
  • by the \(d\) less-significant bits, the address range within either subvector,
  • by the \(1\) dividing bit, \(b_d\), we identify which subvector of the pair is being considered.

To illustrate with an example, suppose for example 16 values in an input vector are at a certain point in a parallel divide and conquer algorithm divided into pairs of 4-vectors (2 such pairs will be seen). Then \(D=4\) (since \(2^4=16\)). The dividing bit is \(b_2\) (so \(d=2\)) leaving on one side bits \(0..d-1\) for subvector-internal addresses, in this case 00 01 10 11. On the other, more-significant side of the \(d\)'th bit are bits \(d+1..D-1\) which are used to select a pair of subvectors, in this case \(d+1=D-1=3\) so \(b_3 = 0\) selects the first pair of 4-vectors, and if \(b_3 = 1\) that selects the last pair of 4 vectors.

Restating, if we are given a \(D\)-bit address range, and a dividing bit \(d < D\), then each subvector of a pair of subvectors contains values with addresses (positions within the subvector) in the range of \(0..d-1\) bits.

\(b_d\), the dividing bit, having value 0 selects the first (left) 4-vector in a pair, and value 1 selects the second (right) 4-vector in a pair.

We applying the divacon structure to our problem.

First, we divide the input vector \(V\) recursively in half until everything is an atomic vector \(|V|=1\). We may say that every atomic vector is fully sorted within itself. At this point, \(d=0\), and each subvector has no internal address space, and \(b_0\) specifies the left or right (atomic) subvector in the recursive algorithm.

Now, we set up our conditions of recurion: we are given a pair of internally presorted \(2^d\)-length vectors.

Apply the divacon expression, \((!min, !max) : \#!mirr\).

  • \(mirr\) mirrors addresses. Given subvectors of length \(2^d\), any address \(a\) is taken to its opposite address by \(!a=2^d-1-a\).
  • \(!mirr\) distributes the \(mirr\) operation over both subvectors, so that each element in both subvectors communicate with its communication partner in the other subvector.
  • # ! mirr says copy, bring, and adjoin that communication partner's data to the local value.

  • (!min, !max) says if v is in the left subvector then keep the minimum value and discard the other, while if v is in the right subvector then keep the maximum value and discard the other.

Hence \((!min,!max) : \# ! mirr\) exchanges the values between communication partners if the right one was smaller than the left one, and leaves them unchanged if the left is equal or smaller than the right.

Another bit of terminology. Given a set of finite values of size \(|V|\) with \(|V|\) even, there must exist a median value \(med\) (not necessarily unique), such that half the values are above or equal to \(med\), and the rest of the values are below or equal to \(med\).

Now there are two lemmas that need proved to verify this recursion. First we join two presorted subvectors L and R in a way which sorts all the below-median values into L and all the above-median values into R. If everything is placed into the correct half, then in a sense we have sorted everything we have access to (the two subvectors being worked on together) according to their highest bit, the dividing bit. There may be chaos within L or within R, but from the dividing bit perspective, everything in the two subvectors is in the correct subvector.

Secondly, we internally comb L and R separately in such a way that they become internally fully sorted.

The proposition of the monosort algorithm is that \((!min,!max):\#!mirr\) accomplishes the first, and that \(!nest\) accomplishes the second. If so, then applying \((!min,!max):\#!mirr\) will subsort at the dividing bit, inducing some chaos in the new L and R, while \(!nest\) will subsort from there so that all the less-significant bits are in order. Once subsorted between L and R and then within L and within R, we can conjoin L and R to make a fully-sorted, double-length vector. This satisfies the recursion requirements for the next level \(d+1\), and hence we can repeat for \(d=0\) to \(d=D-1\) and we are done when the highest-level pair of subvectors are between- and within-sorted, then joined, yielding the final result.

So let's prove these lemmas.

First, we are given equal-length subvectors L and R which are presorted. Then the count of values in L and R together is an even number, so \(med\) exists.

Lemma One: \((!min, !max) : \#!mirr\) places all the values below med into the left subvector and all values above \(med\) into the right subvector.

Proof: by \(mirr\) we associate the first (smallest) in L with the last (largest) in R and step value by value in opposite directions, upward (since it is sorted increasingly) in L and downward in R (since it is also sorted increasingly but we are stepping backwards along R from the end). While the compared left value remains smaller than the right value (which might be true for zero or more values), then \((!min,!max):\#!mirr\) does not swap them. Continuing up L and down R, at some point (which might be at the very start) they may cross so that we find the compared left value exceeds the compared right value (which might be true for zero or more values -- these are all the logical possibilities); after the crossing point we set \(med\) to any value between the closest values on the two sides of the crossing point. Counting onward we continue up L and down R and after the crossing point, the continually increasing values in L will always be greater than their mirr communication partners in R which keep getting smaller as we continue, so \((!min,!max):\#!mirr\) always swaps them. Since L and R were originally pre-sorted, the rest will be swapped as on both L and R the values are only getting farther on the wrong side of \(med\), triggering the swap by an even greater difference. Therefore all the values in L below \(med\) stay in L while those in R above \(med\) stay in R, and after the crossing point conversely, all the values in L above \(med\) move to R while those in R below \(med\) move to L. Hence after the mirr swap applies all the way across L and R, the above-median half ends in R and the below-median half ends in L. QED.

Lemma Two: !nest sorts the !mirr swapped outputs within L and within R respectively, such that each is fully internally sorted.

Proof Suggestions: Now this is a bit more work and depends on the previous median being the limit of rising or falling sequences of values within the vector. Such limits at the turning points of rise-fall or fall-rise value sequences enables us to infer a little bit about the results of \((!min,!max):\#!corr\).

I persuade myself of it by drawing rising or falling lines representing sequences of rising or falling values in L and R between pairs of tall brackets for the inputs and another pair of tall brackets for the outputs, placing between the brackets crossing axes on both representing a conceptual half-count vertical line and a conceptual \(med\) or median horizontal line, and labelling segments of the two inputs in such a way that I can track what happens to each fragment of the inputs over on the output side.

Recall that \(!\) means distribution, so \(!nest\) means separately apply \(nest\) to the left and to the right subvectors. Hence \(!nest\) does not sort between L and R, only within each.

\(nest\) is another Parallel Divide and Conquer algorithm called within \(sort\):

nest = PDC(\(d_{lr},c_{lr},\) (!min,!max):#!corr,id,atom?,id)

Is this familiar yet? It says split the input into left and right half-vectors, immediately communicate between corresponding elements of the two subvectors, swapping if the left value is higher than the right. Then split them again, and do correspondent communication and lower/higher swapping within the each half-subvector.

The result of \((!min,!max):\#!corr\) is to cut up the fall-rise or rise-fall pattern inherited from the previous stage whether of \((!min,!max):\#!mirr\) or of a previous \((!min,!max):\#!corr\) into as much as a rise-fall-rise or a fall-rise-fall based on the new median value and where in the previous pattern it cuts. But since the next round divides the subvectors into half yet again, simplifying the possible up-and-down patterning by half, and since the \(med\) median value inherits from one level of \(\#!corr\) to another maintaining a certain relationship between extremes from the previous round, which is enough to guarantee that at each subdivision, all the high values in the subvector pair go right and all the low values go left. This having been shown for any more-significant bit, \(nest\) recurses to the next less significant bit, and the constraints are inherited again. In this way I show that the full pre-adjustment PDC \(nest\), combing L and R from their own address space's high to low bits all the way to the bottom, sorts on each of those bits, ending with a fully sorted L, or again R.

Hence we step \(d\) from the smallest atomic subvectors (\(d=0\)) to the largest, half-\(V\) subvectors (\(d=D-1\)) doing \(mirr\) swaps to get things sorted with respect to bit \(b_d\), and then combing and more finely recombing with corr swaps between the largest subvectors which with a perfect tidiness pushes the median-structured value sequences into the correct left and right sides of increasingly-divided subvectors, all the way down to atoms again.

Note that the identical \(mirr\) swap operations would occur in a PDC with a preadjustment function using \()d_{eo},c{eo}\) (so it operates on atoms in pairs, to start), or on a PDC postadjustment functon using \(d_{lr},c_{lr}\) (so it operates not until atoms in pairs are available at the end of all the recursive divisions). George Mou has pointed out that

PDC(\(d_{lr},c_{lr},\)id,op,\(pred_b,f_b\)) = PDC(\(d_{eo},c_{eo}\),op,id,\(pred_b,f_b\))

or to put it in words, postadjust on left-right division/combination is the same as preadjust on even-odd division/combination. In both cases, the adjust operation op works first on pairs of related vectors of length 1, and works last on pairs of length \(|V|/2\). In the LR case, the atomic vectors appear after completely dividing the \(2^D\) values \(D\) times, and all the related subvector pairs are processed after the recursive center is completed (postadjust) while in the EO case the atomic vectors appear immediately upon the first even-odd division, and the adjust process applies after the recursive center is completed (preadjust). But the same operations occur on the same data, moving from small to large, atomic to maximal, in either approach. Hence if

\(monosort_{eo} = PDC(d_{eo},c_{eo},!nest:(!min,!max): \#!mirr,id,atom?,id) \)

Then it follows that

\(monosort_{lr} = monosort_{eo}\)

George reports that after presenting monotonic sort to his Yale peers and professors, Alan Perlis pulled him aside to ask him whether another version of a second-order PDC could be found with an alternative nesting relationship, such as \(mirr\) inside \(corr\), etc.

Given the Optimal Mappings paper, it seems clear to me that any permutation of bits should be able to be applied in a monotonic sort algorithm. The dividing bit \(d\) can range, instead of 0 to \(D-1\), from the 0'th to the \(D-1\)'th permuted bits, so long as permuted divide and permuted combine functions are also available.

Just as in hypercube communication, it seems natural that one should be able to can divide on any bit -- any dimension -- and recombine later on that bit in the order specified by the given permutation of bits. If we consider the unsorted values according to their final-sorted ordinal address values, and consider any bit the lowest and any bit the highest according to some fixed permutation \(p\), then \(d_{p}, c_{p}\) will navigate in the permuted order subdividing the binary hypercube corresponing to all the addresses in the input data vector. This requires we consider values or addresses as something more akin to linguistic elements than geometrical ones, and that instead of making smaller and larger groupings, which depends on the non-permutation of the address bits, we abandon our intuitions about what belongs together, and let the algorithm follow correct reasoning.

I'll bet a dollar that the values will shuffle and percolate across surprising distances but in the end will pop into final order no matter the permutation. To that end the terminology above was intended to enable this generalization.

Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: December 1, 2024