Parallel Monotonic Sort

p89-90 Mou 1990

On Mou
on Divide and Conquer

Every solution is either trivial or false.
-- Zhijing George Mou.

Mou's Statement

Mou's thesis, "4.3.5 Monotonic Sort - A Second Order PDC" pp89-90, drops the following statement while expanding on the possibilities of his Parallel-Divide-and-Conquer functions (nesting them, in particular)
PDC's can not only be composed together as shown in Section 4.3.3, but can also be nested into other PDC's to compute more complicated problems. A higher order PDC is a PDC that uses [an]other PDC in its constituent(s). The number of nesting levels is called the order of the PDC. In this section we will show a second order sorting PDC algorithm.

Different from the well-known bitonic sort algorithm, the following PDC program always arranges entries in a monotonic ascending order. It should be observed that the program is a second-order pseudomorphism, where a premorphism is nested inside a postmorphism.

Algorithm 4.12 A Second Order PDC Sorting Algorithm.

sort = PDC(dlr, clr,id, loc : #!mirr, atom?, id)

where

loc = !nest : (!min,!max)

nest = PDC(dlr, clr,(!min,!max) : # ! corr, id, atom?, id)

This algorithm is illustrated in Figure 4.9 in which computation corresponding to the top level is indicated by the circles and computation corresponding to the second level is indicated by the solid dots.

I find this complete, but not elaborated, nor obvious, nor proven, although implementation will show that it in fact works.

Algorithm Statement

Allow me to elaborate, attempting to make this more understandable, even to prove it.

It will help you to first, recall the divacon computation scaffold.

Pre and Post Adjustment

Here sort is a post-adjustment operation, so we divide to atoms then combine recursively doing the work on recombination. On the other hand nest is a pre-adjustment operation, we divide but only once, then do the work on the now-related half-vectors before dividing again. Work is done on division; when finally we reach atoms on division all the work is done, except for concatenating them all bottom up, in this case magically yielding a completely sorted output.

Pre Adjust sort

So, having divided to atoms before doing anything in the preadjust PDC, we then begin the work.

We have pairs of singleton vectors, each of which is already internally sorted (non-decreasing, for sure, how else can you sort one item? -- it is sorted with itself without doing anything). So the prerequisites are met.

Given, then, a pair of vectors each internally sorted, we combine with (!min,!max): #!mirr, which guarantees that output left and right vectors are separated by, or only overlap at, their median: the strictly lower half of the data always ends in the left vector and higher in the right.

However the left and right outputs are not fully sorted. Indeed the left vector rises then falls; the right vector falls then rises; median values if they occur form the peak of the left vector and the trough of the right vector. Each could be considered a sawtooth whether peak-to-peak (hence falling-rising) or valley-to-valley (hence rising-falling).

A Sawtooth or Bitonic Sequence

I want to capture and generalize this idea of a single sawtooth as a sequence: a sequence of values, starting anywhere on the rise or fall of it, and continuing to exactly one sawtooth-width away. Here we go.

Let a sequence be strictly NDNI or loosely 'rising-falling' if it is a concatenation of two zero-or-more-length subsequences the first being Non-Decreasing and the second being Non-Increasing.

Then the empty sequence, a singleton sequence, an increasing or non-decreasing sequence, and a decreasing or non-increasing sequence are all strictly NDNI, because any of the lengths could be zero.

A pair is NDNI because its first element is ND and its second element is NI.)

Let a sequence be 'bitonic' if it is a rotation of an NDNI sequence.

Then:

  • A falling-rising sequence is bitonic.

  • A rising-falling-rising sequence is bitonic and its first element is greater than its last.
  • A falling-rising-rising sequence is bitonic and its first element is less than its last.

  • A strictly rising-falling-rising-falling sequence is NOT bitonic, and
  • A strictly faling-rising-falling-rising sequence is NOT bitonic.

Fun so far. It will come in handy later.
Hence

V = { e_1, e_2, ... e_{N-1} | e_1 <= e_2 <= ... e_{x} >= e_{x+1} >= e_{x+2}.. e_{N-1} }

Now let us say that a vector of length 1, a 1D vector, is both sorted and bitonic, because no ordering constraints from the definitions are violated.

Next, let's have some language for vector ordering.

Given two vectors V_1, V_2, let's say V_1 e_j.

Now we want to prove two results.

1) (!min:!max) : ! # mirr will take two sorted vectors to two bitonic vectors in vector order.

In a way we trade away the internal ordering of the input sorted vectors in return for vector ordering of the output vectors which are no longer sorted, but still at least they are bitonic.

Then our job is to get the bitonic subvectors each back into sorted order, because the top-level sort between the vectors such that V_L 2) (!min:!max) : ! # corr will take a single bitonic vector and make a sorted vector.

Actually it won't but if you do this as a divacon preadjustment at all the levels of division then combination, then will it?. Note that preadjustment with correspondent communication on d_{lr}, c{lr} swaps paired element between corresponding positions in the largest (half) vectors first, then within quarters, etc., down to singleton vectors.

That's would be a nice trick. I'm still not sure it's true. Here is Mou's Monotonic Sort algorithm, slightly reformatted:

sort = PDC(\(d_{lr}, c_{lr}\), id, !nest : (!min,!max) : #!mirr,      atom?, id)
nest = PDC(\(d_{lr}, c_{lr}\),                 (!min,!max) : #!corr, id, atom?, id)

Inside the postmorphism PDC sort is a premorphism PDC, nest.

How does this miracle work?

Look, see, get some intution

It's fun and informative to watch the data flow as follows in an 8-item sort where |V| is the working size of a data vector. Values are shown at beginning, end, and upon a change; otherwise not. Click the triangle to see it.

7 6 5 4 3 2 1 0

Address in V: 000001010011100101110111
Event Outcomes Function Level Event(s)
[7 6 5 4 3 2 1 0] sort |V|=8 begin
[7][6][5][4][3][2][1][0] sort |V|=1 \(d_{lr}\) 3x
[6][7][4][5][2][3][0][1] sort |V|=1 (!min,!max) : !mirr (post)
[6 7][4 5][2 3][0 1] sort |V|=2 \(c_{lr}\)
[5 4][7 6][1 0][3 2] sort |V|=2 (!min,!max) : !mirr (post)
[4][5][6][7][0][1][2][3] nest |V|=1 (!min,!max) : !corr : \(d_{lr}\)(pre)
[4 5][6 7][0 1][2 3] nest |V|=2 \(c_{lr}\) and done in nest
[4 5 6 7][0 1 2 3] sort |V|=4 \(c_{lr}\)
[3 2 1 0][7 6 5 4] sort |V|=4 (!min,!max) : !mirr (post)
[1 0][3 2][5 4][7 6] nest |V|=2 (!min,!max) : !corr : \(d_{lr}\)(pre)
[0][1][2][3][4][5][6][7] nest |V|=1 (!min,!max) : !corr : \(d_{lr}\)
[0 1 2 3][4 5 6 7] nest |V|=4 \(c_{lr}\) 2x, nest done.
[0 1 2 3 4 5 6 7] sort |V|=8 \(c_{lr}\) 1x. sort done.

See how 7 travels, first completing a local-sort within the address low bit at \(|V| = 1\) mirror, then the next higher address bit is made right at \(|V| = 2\) mirror but with 7 in the bottom of its order, just so that nest can clear up the low bit again at \(|V|=1\) corr.

In this way monotonic sort clears the low bit sort task, then pushes higher-half values into the higher half but in reverse order: those travellers are reverse sorted in the higher half, then nest corr moves them by an address changed in its largest bit: in the higher half first (since it's a preadjust), down to the lowest bit in the higher half.

You have the idea of how it flows. Now let's see why it cannot fail.

Technical Explanation and Proof of Correctness

Claude's proof is better than mine.

Proof of Monotonic Sort (Mou, 1990)

Algorithm Statement

  sort = PDC((dlr, clr), id, !nest : (!min,!max) : #!mirr,     atom?, id)
  nest = PDC((dlr, clr),             (!min,!max) : #!corr, id, atom?, id)
  
Input: vector V of length \(N = 2^D\).

sort is a postadjustment PDC: divide V into atoms, then on recombination at each level d (d=0 to D-1), apply !nest : (!min,!max) : #!mirr to each pair of related subvectors of size \(n = 2^d\) before combining them.

Proof Structure

Inductive claim for sort: After processing through level d, every subvector of size \(2^{d+1}\) is sorted.

Base case (d=0): Atoms are trivially sorted. The first postadjustment combines pairs of atoms. \((!min,!max):\#!mirr\) on singletons is just a compare-swap. !nest on atoms is identity. Result: [min, max]. Sorted.

Inductive step: Given sorted \(L = [l_0 \le \ldots \le l_{n-1}]\) and sorted \(R = [r_0 \le \ldots \le r_{n-1}]\), show that (!min,!max) : #!mirr places all values below median in L' and all above median in R' (Sorted Split Lemma, Lemma 1)

We find trivially that L' and R' are sawtooth sequences (bitonic).

Then nesting another PDC (preadjust, compare-swap between corresponding elements) recursively divides each into half-vectors. The compare-swap across halves of a single sawtooth yields two half-vector sawtooths L and R with max(L) \(le\) min(R). (Lemma 2, Bitonic Split Lemma).

Given the Bitonic Split for 2^d-length vector, the low and high halves of the data are forced into the left and right 2^(d-1) length vectors. Recursively, the whole sequence ends fully sorted.

Since at the top level emitted from the mirr sort was already sorted by halves into L and R, and now each half is recursively sorted by halves down to atoms, now L++R is fully sorted.

And since the outer PDC applying mirror compare-swap proceeds from atoms to the full data vector, the last level yields a full sort on the entire input.

That's the idea. Now to believe me you need to hear the Sorted Split Lemma and the Bitonic Split Lemma.


Lemma 1, Sorted Split: (!min,!max):#!mirr separates by median

Statement: Given sorted L and R each of size n, after applying \((!min,!max):\#!mirr\), the result L', R' satisfies max(L') ≤ min(R').

Operation: Mirror communication pairs L[i] with R[n-1-i].

  • L'[i] = min(L[i], R[n-1-i])
  • R'[n-1-i] = max(L[i], R[n-1-i])

Proof: As i increases from 0 to n-1, L[i] increases (L sorted) and R[n-1-i] decreases (R sorted, index decreasing). So the difference R[n-1-i] - L[i] is monotonically non-increasing.

There exists a crossing index k (0 ≤ k ≤ n) such that:

  • For all i < k: L[i] ≤ R[n-1-i] → no swap
  • For some j ≥ k, (we skip equal median values; they don't swap)
    for all i ≥ j: L[i] > R[n-1-i] → swap

First the clear case; then, the exception. Clearly, if there is no case of a median value in either L or R then while counting from opposite ends at some point the values may cross each other. Then there exists a crossing index k (0 ≤ k ≤ n) for which:

  • For all i < k: L[i] < R[n-1-i] → no swap
  • For all i ≥ k: L[i] > R[n-1-i] → swap

After the operation:

  • L' = { L[0], ..., L[k-1], R[n-1-k], ..., R[0] }
  • R' = { L[n-1], ..., L[k], R[n-k], ..., R[n-1] }

Do you see the pairwise comparisons and swaps leave the elements before the crossing index from L in L': Before this median point they remain non-decreasing and in L. Then the elements from R, counting backwards from the end, after the crossing index become less than the median value and get swapped into L.

But now the median value separates L' and R', and indeed max(L') < min(R').

In the output vectors, there are two sections which are increasing-decreasing in L', and decreasing-increasing in R'.

(Therefore Corollary: Both L' and R' are bitonic. We will use this below.)

The conclusion of a top-level sort between L and R remains, all values below the median are in L' and all above median are in R'.

Now the exception. In case of median values in L or R, whether swapping or not, and whether it ends up in L' or R', they form extreme values (max in L', min in R'), only changing the conclusion from strict, max(L') < min(R'), to max(L') ≤ min(R').

QED Lemma 1.


Lemma 2: nest sorts a bitonic sequence

Statement: nest applied to a bitonic sequence of length 2n produces a sorted sequence.

nest = PDC(\(d_{lr}, c_{lr}\), (!min,!max):#!corr, id, atom?, id)

This is a preadjustment PDC. On input B of length 2n:

  1. Apply (!min,!max):#!corr: compare B[i] with B[i+n], put min at i, max at i+n.
  2. Divide: L = first half (mins), R = second half (maxes).
  3. Recurse: nest(L), nest(R).
  4. Combine: concatenate.

Proof by strong induction on n.

Base case (n=1): Input is [a, b], length 2. The corr step compares them and produces [min(a,b), max(a,b)]. Division yields atoms; recursion is identity. Result: sorted.

Inductive step: Assume nest sorts bitonic sequences of length < 2n. We need the Bitonic Split Lemma:

After (!min,!max):#!corr on a bitonic input of length 2n:
(a) max(L) ≤ min(R), and
(b) L and R are each bitonic of length n.

If this holds, then by induction nest(L) is sorted, nest(R) is sorted, and max(L) ≤ min(R), so L++R is sorted.

Proof of the Bitonic Split Lemma

We use the 0-1 Principle (Knuth, TAOCP 5.3.4): a comparator network sorts all inputs correctly if and only if it sorts all 0-1 inputs. It suffices to prove the lemma for 0-1 bitonic sequences.

A 0-1 bitonic sequence has at most two transitions (block boundaries) when viewed circularly. Every 0-1 bitonic sequence of length 2n has one of these forms:

  • All 0s or all 1s
  • 0a 1b 0c (a+b+c = 2n, b ≥ 1)
  • 1a 0b 1c (a+b+c = 2n, b ≥ 1)
  • 0a 1b (a+b = 2n) [subcase of the above with c=0 or a=0]

For 0-1 inputs, the corr operation computes:

  • L[i] = B[i] AND B[i+n]
  • R[i] = B[i] OR B[i+n]

Part (a): max(L) ≤ min(R) for 0-1 bitonic B

Equivalently: it cannot happen that L[i] = 1 and R[j] = 0 simultaneously.

L[i] = 1 requires B[i] = 1 AND B[i+n] = 1.
R[j] = 0 requires B[j] = 0 AND B[j+n] = 0.

Case 0a 1b 0c: The 1-positions are the interval [a, a+b).

L[i] = 1 requires both i ∈ [a, a+b) and i+n ∈ [a, a+b), so i ∈ [a, a+b-n). This is nonempty only when b > n.

R[j] = 0 requires both j ∉ [a, a+b) and j+n ∉ [a, a+b). The 0-positions total 2n - b < n (when b > n). Two positions j, j+n that are n apart cannot both lie in a set of fewer than n contiguous-or-two-block positions of total size < n. (If j and j+n are both outside [a, a+b): j < a and j+n ≥ a+b requires n ≥ b, contradicting b > n. Both in leading 0s requires j+n < a, impossible since n > a when b > n and a+b+c = 2n. Both in trailing 0s requires j ≥ a+b, then j+n ≥ a+b+n > 2n, out of range.)

When b ≤ n: L[i] = 1 is impossible (no i satisfies both i and i+n in [a, a+b) of width ≤ n). So L is all 0s.

Case 1a 0b 1c: Symmetric argument.

Therefore max(L) ≤ min(R) for all 0-1 bitonic sequences. By the 0-1 principle: max(L) ≤ min(R) for all bitonic sequences.

Part (b): L and R are bitonic

For 0-1 sequences, L[i] = B[i] AND B[i+n], R[i] = B[i] OR B[i+n].

The 1-positions of B, viewed as a subset of {0, ..., 2n-1}, form at most two intervals. Define:
  S1 = {i ∈ [0,n) : B[i] = 1}
  S2 = {i ∈ [0,n) : B[i+n] = 1}

Each of S1 and S2 is the intersection of the 1-positions of B with [0,n) or [n,2n) respectively (shifted). Since B's 1-positions form at most two intervals in [0,2n), each of S1 and S2 is a union of at most two intervals in [0,n).

L's 1-positions = S1 ∩ S2 = intersection of ≤2 intervals with ≤2 intervals = at most 2 intervals. (Intersection of two unions of intervals.)

R's 1-positions = S1 ∪ S2 = union of ≤2 intervals with ≤2 intervals = at most 4 intervals. But we need at most 2 for bitonicity.

The tighter bound on R follows from the constraint that S1 and S2 derive from a single bitonic sequence: the 1-positions of B form at most two contiguous arcs on the circle {0,...,2n-1}. Projecting onto [0,n) by collapsing i and i+n (via OR), the image has at most two intervals. (Proof: if the 1-positions are a single arc, the OR-projection onto [0,n) is at most two intervals—the arc may wrap once. If two arcs, they are complementary parts of the circle, and the projection again yields at most two intervals.)

Hence L and R are 0-1 sequences with at most two block boundaries: bitonic. By the 0-1 principle, for general sequences, L and R are bitonic.

This completes the Bitonic Split Lemma, hence Lemma 2, hence the proof.


Summary

  1. sort divides to atoms (trivially sorted), then recombines bottom-up.
  2. At each level, given sorted L, R:
    • \((!min,!max):\#!mirr\) separates below-median → L', above-median → R' (Lemma 1). L' and R' are bitonic (Corollary).
    • !nest sorts each bitonic subvector via recursive corr-splitting (Lemma 2).
    • Combine: sorted L' ++ sorted R', with max(L') ≤ min(R') → sorted.
  3. nest sorts bitonic sequences by: corr-split producing two smaller bitonic sequences with left ≤ right (Bitonic Split Lemma), recurse.
  4. Induction closes at both levels. Monotonic sort is correct.

Complexity: \(O(\log^2 N)\) parallel time, \(O(N \log^2 N)\) total comparisons. ▮

Let's define some terms. First let the input vector be of length \(|V|=2^D\) for some nonnegative integer \(D\). For instance if there are \(1024=2^{10}\) values, then \(D=10\).

Second, let there be a divide operation independent of vector values (hence "polymorphic") that takes a vector of length |V| to two subvectors of length |V|/2. I find it helpful to refer to these as related subvectors; in divacon a combine operation will operate on two related subvectors to make a double-length vector.

Then the address range of \(V\) comprises \(D\) bits, and its ordinal addresses are numbers \(0\) to \(2^{D}-1\). In binary encoding an address is a set of bits \(B = \{ b_{D-1}, b_{D-2}, ... b_2, b_1, b_0 \}\) from most to least significant (leading zeroes retained explicitly).

Such an address range may have its bits permuted by any numbering system such as from \(0\) to \(D-1\) interpreted as from least significant to most significant, that is, low to high. High to low is another permutation, but note that any mapping between \(0..D-1\) and the various \(D\) bits, represented as a permutation of bits \(B), can potentially be used to organize the same operations, as George Mou showed in his Optimal Mappings paper.

The \(D\) bits of an address may be divided on the \(d\)'th bit in the numbering system, \(b_d\in\ {0,1}\). Suppose \(d+e=D\) and we are counting bits from least to most significant. If we know \(d\), then that tells us where we are in a divide and conquer computation graph, in which, generally speaking, two subvectors of some (same, power of 2) size are being divided from or combined into one double-size one, and this is happening with some number of subvector pairs. The depth \(d\) specifies:

  • by the \(D-d-1\) more-significant bits, which pair of subvectors is being looked at at the moment;
  • by the \(d\) less-significant bits, the address range within either subvector,
  • by the \(1\) dividing bit, \(b_d\), we identify which subvector of the pair is being considered.

To illustrate with an example, suppose for example 16 values in an input vector are at a certain point in a parallel divide and conquer algorithm divided into pairs of 4-vectors (2 such pairs will be seen). Then \(D=4\) (since \(2^4=16\)). The dividing bit is \(b_2\) (so \(d=2\)) leaving on one side bits \(0..d-1\) for subvector-internal addresses, in this case 00 01 10 11. On the other, more-significant side of the \(d\)'th bit are bits \(d+1..D-1\) which are used to select a pair of subvectors, in this case \(d+1=D-1=3\) so \(b_3 = 0\) selects the first pair of 4-vectors, and if \(b_3 = 1\) that selects the last pair of 4 vectors.

Restating, if we are given a \(D\)-bit address range, and a dividing bit \(d < D\), then each subvector of a pair of subvectors contains values with addresses (positions within the subvector) in the range of \(0..d-1\) bits.

\(b_d\), the dividing bit, having value 0 selects the first (left) 4-vector in a pair, and value 1 selects the second (right) 4-vector in a pair.

We applying the divacon structure to our problem.

First, we divide the input vector \(V\) recursively in half until everything is an atomic vector \(|V|=1\). We may say that every atomic vector is fully sorted within itself. At this point, \(d=0\), and each subvector has no internal address space, and \(b_0\) specifies the left or right (atomic) subvector in the recursive algorithm.

Now, we set up our conditions of recurion: we are given a pair of internally presorted \(2^d\)-length vectors.

Apply the divacon expression, \((!min, !max) : \#!mirr\).

  • \(mirr\) mirrors addresses. Given subvectors of length \(2^d\), any address \(a\) is taken to its opposite address by \(!a=2^d-1-a\).
  • \(!mirr\) distributes the \(mirr\) operation over both subvectors, so that each element in both subvectors communicates with its communication partner in the other subvector.
  • # ! mirr says copy, bring, and adjoin that communication partner's data to the local value, from the mirroring or opposite end of the opposite vector.

  • (!min, !max) applied to a pair of vectors says if v is in the left subvector then keep the minimum value and discard the other, while if v is in the right subvector then keep the maximum value and discard the other.

Hence \((!min,!max) : \# ! mirr\) exchanges the values between communication partners if the right one was smaller than the left one, and leaves them unchanged if the left is equal or smaller than the right.

Another bit of terminology. Given a set of finite values of size \(|V|\) with \(|V|\) even, there must exist a median value \(med\) (not necessarily unique), such that half the values are above or equal to \(med\), and the rest of the values are below or equal to \(med\).

Now there are two lemmas that need proved to verify this recursion. First we join two presorted subvectors L and R in a way which sorts all the below-median values into L and all the above-median values into R. If everything is placed into the correct half, then in a sense we have sorted everything we have access to (the two subvectors being worked on together) according to their index's highest bit, the dividing bit. There may be chaos within L or within R, but from the dividing bit perspective, everything in the two subvectors is in the correct subvector, the lower values in the left and the higher values in the right.

Secondly, we internally comb L and R separately in such a way that they become internally fully sorted.

The proposition of the monosort algorithm is that \(post=(!min,!max):\#!mirr\) accomplishes the first, and that \(pre=!nest\) accomplishes the second. If so, then applying \((!min,!max):\#!mirr\) will subsort at the dividing bit, inducing some limited chaos in the new L and R, while \(!nest\) will subsort from there so that all the less-significant bits are in order. (The limited chaos is that each of L and R are not random but bitonic.)

Once sorted into lower and upper half-vectors L and R, and then subsorted within L and within R, we can conjoin L and R to make a fully-sorted, double-length vector. This satisfies the recursion requirements for the next level \(d+1\), and hence we can repeat for \(d=0\) to \(d=D-1\) and we are done when the highest-level pair of subvectors are between- and within-sorted, then joined, yielding the final result.

So let's prove these lemmas.

First, we are given equal-length subvectors L and R which are presorted. Then the count of values in L and R together is an even number, so the median \(med\) exists (defined as halfway between the two middlemost points).

Lemma One: \((!min, !max) : \#!mirr\) places all the values below med into the left subvector and all values above \(med\) into the right subvector.

Proof: by \(mirr\) we associate the first (smallest) in L with the last (largest) in R and step value by value in opposite directions, upward (since it is sorted increasingly) in L and downward in R (since it is also sorted increasingly but we are stepping backwards along R from the end). While the compared left value remains smaller than the right value (which might be true for zero or more values), then \((!min,!max):\#!mirr\) does not swap them. Continuing up L and down R, at some point (which might be at the very start) they may cross so that we find the compared left value exceeds the compared right value (which might be true for zero or more values -- these are all the logical possibilities); after the crossing point we set \(med\) to any value between the closest values on the two sides of the crossing point, such as their average. Counting onward we continue up L and down R and after the crossing point, the continually increasing values in L will always be greater than their mirr communication partners in R which keep getting smaller as we continue, so \((!min,!max):\#!mirr\) always swaps them. Since L and R were originally pre-sorted, the rest will be swapped as on both L and R the values are only getting farther on the wrong side of \(med\), triggering the swap by an even greater difference. Therefore all the values in L below \(med\) stay in L while those in R above \(med\) stay in R, and after the crossing point conversely, all the values in L above \(med\) move to R while those in R below \(med\) move to L. Hence after the mirr swap applies all the way across L and R, the above-median half ends in R and the below-median half ends in L. QED.

Lemma Two: pre = !nest applied recursively sorts the !mirr swapped outputs within L and within R respectively, such that each is fully internally sorted. (Claude says this "Bitonic Split Lemma (asserts that) (!min,!max):#!corr on a bitonic sequence of length 2n produces two bitonic halves with max(left) <= min(right)", which is nice)

In this part we apply preadjust (!min,!max): #!corr to the largest vector pair from the current level of postadjust (!min,!max):#!mirr (call them L and R) and then separately to their halves, etc., recursively down to unit vectors again.

At each point we are given that L and R are already bitonically sorted, and the proof demonstrates that applying (!min,!max): #!corr to an already-bitonically sorted vector (splitting it in half and communicating between corresponding elements within the two halves, pairwise retaining the lower on the left and the higher on the right), yields two bitonically sorted subvectors, a new L and R, each of which has all its elements lower or higher than all the elements of the other. So not only are the values in each half the low and high halves of the values, but each of subvectors is now itself bitonic, thus meeting the requirements to apply the same procedure again. When the two halves are fully in order relative to each other, then halves of each half are in their own relative order, etc., then in the end everything is fully in order. Does it make sense?

Then we have a recursive, preadjustment, corresponding-element-swap, (that is proceeding from largest to smallest on outputs of the postadjustment mirror-element-swap). Given bitonic sequencing in the input, it achieves an ordering of the data halves into the half-sub-vectors, along with bitonic sequencing in each of the output half-sub-vectors, so it can be applied again.

So we guarantee a complete ordering of the whole L,R sequence, and are ready for the next expanded level of post-adjustment, mirror-element-swap. When that comes out to the top, we are fully sorted. That's it.

In the center of this argument is the bitonic vector to two bitonic vectors mapping, which preadjustment (!min,!max):#!corr is supposed to do.

Have I persuaded you of this part, yet? Let R = [r_1 .. r_m] and F = [f_1 .. f_n] be rising and falling sequences. Append them into a single vector V' of length (m+n). Apply a circular shift of length \(c \lt m+n\) to yield a vector V. Then V is bitonically ordered. There are a few cases: c=0, 0(m+n)/2, yielding six combinations. I persuade myself of it by drawing rising or falling lines representing sequences of rising or falling values in a vector, and doing some circular rotation of it, and seeing how the ends are themselves ordered because they were previous-to-rotation adjacent and therefore increasing or decreasing in that order. This fact lets us see that the mirror communication swap keeps everything that shoujld be in the low half in L and everything that should be in the high half in R, and each of L and R is at worst, once again bitonically ordered.


Recall that \(!\) means distribution, so \(!nest\) means separately apply \(nest\) to the left and to the right subvectors. Hence \(!nest\) does not sort between L and R, only within each.

\(nest\) is another Parallel Divide and Conquer algorithm called within \(sort\):

nest = PDC(\(d_{lr},c_{lr},\) (!min,!max):#!corr,id,atom?,id)

Is this familiar yet? It says (\(d_{lr}\)) split the input into left and right half-vectors, (preadjust) immediately (\(#!corr\)) communicate between corresponding elements of the two subvectors, \((!min,!max)\) swapping if the left value is higher than the right. Then split them again, and do correspondent communication and lower/higher swapping within the each half-subvector.

The result of \((!min,!max):\#!corr\) is to cut up the fall-rise or rise-fall pattern inherited from the previous stage whether of \((!min,!max):\#!mirr\) or of a previous \((!min,!max):\#!corr\) into as much as a rise-fall-rise or a fall-rise-fall based on the new median value and where in the previous pattern it cuts. But since the next round divides the subvectors into half yet again, simplifying the possible up-and-down patterning by half, and since the \(med\) median value inherits from one level of \(\#!corr\) to another maintaining a certain relationship between extremes from the previous round, which is enough to guarantee that at each subdivision, all the high values in the subvector pair go right and all the low values go left. This having been shown for any more-significant bit, \(nest\) recurses to the next less significant bit, and the constraints are inherited again. In this way I show that the full pre-adjustment PDC \(nest\), combing L and R from their own address space's high to low bits all the way to the bottom, sorts on each of those bits, ending with a fully sorted L, or again R.

Hence we step \(d\) from the smallest atomic subvectors (\(d=0\)) to the largest, half-\(V\) subvectors (\(d=D-1\)) doing \(mirr\) swaps to get things sorted with respect to bit \(b_d\), and then combing and more finely recombing with corr swaps between the largest subvectors which with a perfect tidiness pushes the median-structured value sequences into the correct left and right sides of increasingly-divided subvectors, all the way down to atoms again.

Note that the identical \(mirr\) swap operations would occur in a PDC with a preadjustment function using \()d_{eo},c{eo}\) (so it operates on atoms in pairs, to start), or on a PDC postadjustment functon using \(d_{lr},c_{lr}\) (so it operates not until atoms in pairs are available at the end of all the recursive divisions). George Mou has pointed out that

PDC(\(d_{lr},c_{lr},\)id,op,\(pred_b,f_b\)) = PDC(\(d_{eo},c_{eo}\),op,id,\(pred_b,f_b\))

or to put it in words, postadjust on left-right division/combination is the same as preadjust on even-odd division/combination. In both cases, the adjust operation op works first on pairs of related vectors of length 1, and works last on pairs of length \(|V|/2\). In the LR case, the atomic vectors appear after completely dividing the \(2^D\) values \(D\) times, and all the related subvector pairs are processed after the recursive center is completed (postadjust) while in the EO case the atomic vectors appear immediately upon the first even-odd division, and the adjust process applies after the recursive center is completed (preadjust). But the same operations occur on the same data, moving from small to large, atomic to maximal, in either approach. Hence if

\(monosort_{eo} = PDC(d_{eo},c_{eo},!nest:(!min,!max): \#!mirr,id,atom?,id) \)

Then it follows that

\(monosort_{lr} = monosort_{eo}\)

George reports that after presenting monotonic sort to his Yale peers and professors, Alan Perlis pulled him aside to ask him whether another version of a second-order PDC could be found with an alternative nesting relationship, such as \(mirr\) inside \(corr\), etc.

Given the Optimal Mappings paper, it seems clear to me that any permutation of bits should be able to be applied in a monotonic sort algorithm. The dividing bit \(d\) can range, instead of 0 to \(D-1\), from the 0'th to the \(D-1\)'th permuted bits, so long as permuted divide and permuted combine functions are also available.

Just as in hypercube communication, it seems natural that one should be able to can divide on any bit -- any dimension -- and recombine later on that bit in the order specified by the given permutation of bits. If we consider the unsorted values according to their final-sorted ordinal address values, and consider any bit the lowest and any bit the highest according to some fixed permutation \(p\), then \(d_{p}, c_{p}\) will navigate in the permuted order subdividing the binary hypercube corresponing to all the addresses in the input data vector. This requires we consider values or addresses as something more akin to linguistic elements than geometrical ones, and that instead of making smaller and larger groupings, which depends on the non-permutation of the address bits, we abandon our intuitions about what belongs together, and let the algorithm follow correct reasoning.

I'll bet a dollar that the values will shuffle and percolate across surprising distances but in the end will pop into final order no matter the permutation. To that end the terminology above was intended to enable this generalization.

Your thoughts?
(will not be shared or abused)
Comment:
                                        Feedback is welcome.
Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: December 1, 2024