Mou's StatementMou's thesis, "4.3.5 Monotonic Sort - A Second Order PDC" pp89-90, drops the following statement while expanding on the possibilities of his Parallel-Divide-and-Conquer functions (nesting them, in particular)PDC's can not only be composed together as shown in Section 4.3.3, but can also be nested into other PDC's to compute more complicated problems. A higher order PDC is a PDC that uses [an]other PDC in its constituent(s). The number of nesting levels is called the order of the PDC. In this section we will show a second order sorting PDC algorithm.I find this complete, but not elaborated, nor obvious, nor proven, although implementation will show that it in fact works. Algorithm StatementAllow me to elaborate, attempting to make this more understandable, even to prove it.It will help you to first, recall the divacon computation scaffold.
Pre and Post AdjustmentHere sort is a post-adjustment operation, so we divide to atoms then combine recursively doing the work on recombination. On the other hand nest is a pre-adjustment operation, we divide but only once, then do the work on the now-related half-vectors before dividing again. Work is done on division; when finally we reach atoms on division all the work is done, except for concatenating them all bottom up, in this case magically yielding a completely sorted output.
Pre Adjust sortSo, having divided to atoms before doing anything in the preadjust PDC, we then begin the work.We have pairs of singleton vectors, each of which is already internally sorted (non-decreasing, for sure, how else can you sort one item? -- it is sorted with itself without doing anything). So the prerequisites are met. Given, then, a pair of vectors each internally sorted, we combine with (!min,!max): #!mirr, which guarantees that output left and right vectors are separated by, or only overlap at, their median: the strictly lower half of the data always ends in the left vector and higher in the right. However the left and right outputs are not fully sorted. Indeed the left vector rises then falls; the right vector falls then rises; median values if they occur form the peak of the left vector and the trough of the right vector. Each could be considered a sawtooth whether peak-to-peak (hence falling-rising) or valley-to-valley (hence rising-falling).
A Sawtooth or Bitonic SequenceI want to capture and generalize this idea of a single sawtooth as a sequence: a sequence of values, starting anywhere on the rise or fall of it, and continuing to exactly one sawtooth-width away. Here we go.Let a sequence be strictly NDNI or loosely 'rising-falling' if it is a concatenation of two zero-or-more-length subsequences the first being Non-Decreasing and the second being Non-Increasing.
Then the empty sequence, a singleton sequence, an increasing or non-decreasing sequence, and a decreasing or non-increasing sequence are all strictly NDNI, because any of the lengths could be zero. Let a sequence be 'bitonic' if it is a rotation of an NDNI sequence.
Then:Fun so far. It will come in handy later. Hence
V = { e_1, e_2, ... e_{N-1} | e_1 <= e_2 <= ... e_{x} >= e_{x+1} >= e_{x+2}.. e_{N-1} } Now let us say that a vector of length 1, a 1D vector, is both sorted and bitonic, because no ordering constraints from the definitions are violated. Next, let's have some language for vector ordering.
Given two vectors V_1, V_2, let's say V_1
Now we want to prove two results.
1) (!min:!max) : ! # mirr will take two sorted vectors to two bitonic vectors in vector order.
In a way we trade away the internal ordering of the input sorted
vectors in return for vector ordering of the output vectors which
are no longer sorted, but still at least they are bitonic.
Then our job is to get the bitonic subvectors each back into sorted
order, because the top-level sort between the vectors such that V_L
Actually it won't but if you do this as a divacon preadjustment at
all the levels of division then combination, then will it?. Note
that preadjustment with correspondent communication on d_{lr},
c{lr} swaps paired element between corresponding positions in the
largest (half) vectors first, then within quarters, etc., down to
singleton vectors.
That's would be a nice trick. I'm still not sure it's true.
Here is Mou's Monotonic Sort algorithm, slightly reformatted:
Inside the postmorphism PDC sort is a premorphism PDC, nest.
How does this miracle work?
See how 7 travels, first completing a local-sort within the
address low bit at \(|V| = 1\) mirror, then the next higher address
bit is made right at \(|V| = 2\) mirror but with 7 in the bottom of its
order, just so that nest can clear up the low bit again at \(|V|=1\) corr.
In this way monotonic sort clears the low bit sort task, then pushes
higher-half values into the higher half but in reverse order: those
travellers are reverse sorted in the higher half, then nest corr moves
them by an address changed in its largest bit: in the higher half
first (since it's a preadjust), down to the lowest bit in the higher
half.
You have the idea of how it flows. Now let's see why it cannot fail.
sort is a postadjustment PDC: divide V into atoms, then on recombination
at each level d (d=0 to D-1), apply !nest : (!min,!max) : #!mirr to each
pair of related subvectors of size \(n = 2^d\) before combining them.
Base case (d=0): Atoms are trivially sorted. The first postadjustment
combines pairs of atoms. \((!min,!max):\#!mirr\) on singletons is just a
compare-swap. !nest on atoms is identity. Result: [min, max]. Sorted.
Inductive step: Given sorted \(L = [l_0 \le \ldots \le
l_{n-1}]\) and sorted \(R = [r_0 \le \ldots \le r_{n-1}]\), show
that (!min,!max) : #!mirr places all values below median in L'
and all above median in R' (Sorted Split Lemma, Lemma 1)
We find trivially that L' and R' are sawtooth sequences (bitonic).
Then nesting another PDC (preadjust, compare-swap between
corresponding elements) recursively divides each into half-vectors.
The compare-swap across halves of a single sawtooth yields two
half-vector sawtooths L and R with max(L) \(le\) min(R). (Lemma
2, Bitonic Split Lemma).
Given the Bitonic Split for 2^d-length vector, the low and high
halves of the data are forced into the left and right 2^(d-1) length
vectors. Recursively, the whole sequence ends fully sorted.
Since at the top level emitted from the mirr sort was already sorted
by halves into L and R, and now each half is recursively sorted by
halves down to atoms, now L++R is fully sorted.
And since the outer PDC applying mirror compare-swap proceeds from
atoms to the full data vector, the last level yields a full sort on
the entire input.
That's the idea. Now to believe me you need to hear the Sorted Split Lemma
and the Bitonic Split Lemma.
Operation: Mirror communication pairs L[i] with R[n-1-i].
Proof: As i increases from 0 to n-1, L[i] increases (L sorted) and
R[n-1-i] decreases (R sorted, index decreasing).
So the difference R[n-1-i] - L[i] is monotonically non-increasing.
There exists a crossing index k (0 ≤ k ≤ n) such that:
First the clear case; then, the exception. Clearly, if there is no
case of a median value in either L or R then while counting from
opposite ends at some point the values may cross each other. Then
there exists a crossing index k (0 ≤ k ≤ n) for which:
After the operation:
Do you see the pairwise comparisons and swaps leave the elements
before the crossing index from L in L': Before this median point they
remain non-decreasing and in L. Then the elements from R, counting
backwards from the end, after the crossing index become less than the
median value and get swapped into L.
But now the median value separates L' and R', and indeed max(L') < min(R').
In the output vectors, there are two sections which are
increasing-decreasing in L', and decreasing-increasing in R'.
(Therefore Corollary: Both L' and R' are bitonic. We will use this below.)
The conclusion of a top-level sort between L and R remains, all values
below the median are in L' and all above median are in R'.
Now the exception. In case of median values in L or R, whether
swapping or not, and whether it ends up in L' or R', they form extreme
values (max in L', min in R'), only changing the conclusion from
strict, max(L') < min(R'), to max(L') ≤ min(R').
QED Lemma 1.
This is a preadjustment PDC. On input B of length 2n:
Proof by strong induction on n.
Base case (n=1): Input is [a, b], length 2. The corr step compares
them and produces [min(a,b), max(a,b)]. Division yields atoms; recursion
is identity. Result: sorted.
Inductive step: Assume nest sorts bitonic sequences of length < 2n.
We need the Bitonic Split Lemma:
If this holds, then by induction nest(L) is sorted, nest(R) is sorted,
and max(L) ≤ min(R), so L++R is sorted.
A 0-1 bitonic sequence has at most two transitions (block boundaries)
when viewed circularly. Every 0-1 bitonic sequence of length 2n has one
of these forms:
For 0-1 inputs, the corr operation computes:
L[i] = 1 requires B[i] = 1 AND B[i+n] = 1.
Case 0a 1b 0c:
The 1-positions are the interval [a, a+b).
L[i] = 1 requires both i ∈ [a, a+b) and i+n ∈ [a, a+b),
so i ∈ [a, a+b-n). This is nonempty only when b > n.
R[j] = 0 requires both j ∉ [a, a+b) and j+n ∉ [a, a+b).
The 0-positions total 2n - b < n (when b > n).
Two positions j, j+n that are n apart cannot both lie in a set of
fewer than n contiguous-or-two-block positions of total size < n.
(If j and j+n are both outside [a, a+b): j < a and j+n ≥ a+b
requires n ≥ b, contradicting b > n. Both in leading 0s requires
j+n < a, impossible since n > a when b > n and a+b+c = 2n. Both
in trailing 0s requires j ≥ a+b, then j+n ≥ a+b+n > 2n, out of range.)
When b ≤ n: L[i] = 1 is impossible (no i satisfies both i and i+n in
[a, a+b) of width ≤ n). So L is all 0s.
Case 1a 0b 1c: Symmetric argument.
Therefore max(L) ≤ min(R) for all 0-1 bitonic sequences.
By the 0-1 principle: max(L) ≤ min(R) for all bitonic sequences.
The 1-positions of B, viewed as a subset of {0, ..., 2n-1}, form at
most two intervals. Define:
Each of S1 and S2 is the intersection of the 1-positions of B with
[0,n) or [n,2n) respectively (shifted). Since B's 1-positions form at
most two intervals in [0,2n), each of S1 and S2 is a union of at most
two intervals in [0,n).
L's 1-positions = S1 ∩ S2 = intersection of ≤2 intervals with ≤2
intervals = at most 2 intervals. (Intersection of two unions of intervals.)
R's 1-positions = S1 ∪ S2 = union of ≤2 intervals with ≤2 intervals
= at most 4 intervals. But we need at most 2 for bitonicity.
The tighter bound on R follows from the constraint that S1 and S2
derive from a single bitonic sequence: the 1-positions of B form at
most two contiguous arcs on the circle {0,...,2n-1}. Projecting onto
[0,n) by collapsing i and i+n (via OR), the image has at most two
intervals. (Proof: if the 1-positions are a single arc, the OR-projection
onto [0,n) is at most two intervals—the arc may wrap once. If two arcs,
they are complementary parts of the circle, and the projection again
yields at most two intervals.)
Hence L and R are 0-1 sequences with at most two block boundaries:
bitonic. By the 0-1 principle, for general sequences, L and R are
bitonic.
This completes the Bitonic Split Lemma, hence Lemma 2, hence the proof. ▮
Complexity: \(O(\log^2 N)\) parallel time, \(O(N \log^2 N)\) total comparisons. ▮
Second, let there be a divide operation independent of vector values
(hence "polymorphic") that takes a vector of length |V| to two
subvectors of length |V|/2. I find it helpful to refer to these as
related subvectors; in divacon a combine operation will
operate on two related subvectors to make a double-length vector.
Then the address range of \(V\) comprises \(D\) bits, and its ordinal
addresses are numbers \(0\) to \(2^{D}-1\). In binary encoding an
address is a set of bits \(B = \{ b_{D-1}, b_{D-2}, ... b_2, b_1, b_0 \}\)
from most to least significant (leading zeroes retained explicitly).
Such an address range may have its bits permuted by any numbering
system such as from \(0\) to \(D-1\) interpreted as from least
significant to most significant, that is, low to high. High to low is
another permutation, but note that any mapping between \(0..D-1\) and
the various \(D\) bits, represented as a permutation of bits \(B), can
potentially be used to organize the same operations, as George Mou
showed in his Optimal Mappings
paper.
The \(D\) bits of an address may be divided on the \(d\)'th bit in the
numbering system, \(b_d\in\ {0,1}\). Suppose \(d+e=D\) and we are
counting bits from least to most significant. If we know \(d\), then
that tells us where we are in a divide and conquer computation graph,
in which, generally speaking, two subvectors of some (same, power of
2) size are being divided from or combined into one double-size one,
and this is happening with some number of subvector pairs. The depth
\(d\) specifies:
To illustrate with an example, suppose for example 16 values in an
input vector are at a certain point in a parallel divide and conquer
algorithm divided into pairs of 4-vectors (2 such pairs will be seen).
Then \(D=4\) (since \(2^4=16\)). The dividing bit is \(b_2\) (so \(d=2\)) leaving on
one side bits \(0..d-1\) for subvector-internal addresses, in this case 00
01 10 11. On the other, more-significant side of the \(d\)'th bit are
bits \(d+1..D-1\) which are used to select a pair of subvectors, in this
case \(d+1=D-1=3\) so \(b_3 = 0\) selects the first pair of 4-vectors, and if
\(b_3 = 1\) that selects the last pair of 4 vectors.
Restating, if we are given a \(D\)-bit address range, and a dividing bit
\(d < D\), then each subvector of a pair of subvectors contains values with
addresses (positions within the subvector) in the range of \(0..d-1\)
bits.
\(b_d\), the dividing bit, having value 0 selects the first (left)
4-vector in a pair, and value 1 selects the second (right) 4-vector in
a pair.
First, we divide the input vector \(V\) recursively in half until
everything is an atomic vector \(|V|=1\). We may say that every atomic
vector is fully sorted within itself. At this point, \(d=0\), and each
subvector has no internal address space, and \(b_0\) specifies the left or
right (atomic) subvector in the recursive algorithm.
Now, we set up our conditions of recurion: we are given a pair of
internally presorted \(2^d\)-length vectors.
Apply the divacon expression, \((!min, !max) : \#!mirr\).
Hence \((!min,!max) : \# ! mirr\) exchanges the values between
communication partners if the right one was smaller than the left one,
and leaves them unchanged if the left is equal or smaller than the
right.
Another bit of terminology. Given a set of finite values of size \(|V|\)
with \(|V|\) even, there must exist a median value \(med\) (not necessarily
unique), such that half the values are above or equal to \(med\), and the
rest of the values are below or equal to \(med\).
Now there are two lemmas that need proved to verify this recursion.
First we join two presorted subvectors L and R in a way which sorts
all the below-median values into L and all the above-median values
into R. If everything is placed into the correct half, then in a sense
we have sorted everything we have access to (the two subvectors being
worked on together) according to their index's highest bit, the
dividing bit. There may be chaos within L or within R, but from the
dividing bit perspective, everything in the two subvectors is in the
correct subvector, the lower values in the left and the higher values
in the right.
Secondly, we internally comb L and R separately in such a way that
they become internally fully sorted.
The proposition of the monosort algorithm is that
\(post=(!min,!max):\#!mirr\) accomplishes the first, and that
\(pre=!nest\) accomplishes the second. If so, then applying
\((!min,!max):\#!mirr\) will subsort at the dividing bit, inducing
some limited chaos in the new L and R, while \(!nest\) will subsort
from there so that all the less-significant bits are in order. (The
limited chaos is that each of L and R are not random but bitonic.)
Once sorted into lower and upper half-vectors L and R, and then
subsorted within L and within R, we can conjoin L and R to make a
fully-sorted, double-length vector. This satisfies the recursion
requirements for the next level \(d+1\), and hence we can repeat for
\(d=0\) to \(d=D-1\) and we are done when the highest-level pair of
subvectors are between- and within-sorted, then joined, yielding the
final result.
So let's prove these lemmas.
First, we are given equal-length subvectors L and R which are
presorted. Then the count of values in L and R together is an even
number, so the median \(med\) exists (defined as halfway between the
two middlemost points).
Lemma One: \((!min, !max) : \#!mirr\) places all the values below med into the
left subvector and all values above \(med\) into the right subvector.
At each point we are given that L and R are already bitonically
sorted, and the proof demonstrates that applying (!min,!max):
#!corr to an already-bitonically sorted vector (splitting it in half
and communicating between corresponding elements within the two
halves, pairwise retaining the lower on the left and the higher on the
right), yields two bitonically sorted subvectors, a new L and R, each
of which has all its elements lower or higher than all the elements of
the other. So not only are the values in each half the low and high
halves of the values, but each of subvectors is now itself bitonic,
thus meeting the requirements to apply the same procedure again. When
the two halves are fully in order relative to each other, then halves
of each half are in their own relative order, etc., then in the end
everything is fully in order. Does it make sense?
Then we have a recursive, preadjustment, corresponding-element-swap,
(that is proceeding from largest to smallest on outputs of the
postadjustment mirror-element-swap). Given bitonic sequencing in the
input, it achieves an ordering of the data halves into the
half-sub-vectors, along with bitonic sequencing in each of the output
half-sub-vectors, so it can be applied again.
So we guarantee a complete ordering of the whole L,R sequence, and are
ready for the next expanded level of post-adjustment,
mirror-element-swap. When that comes out to the top, we are fully
sorted. That's it.
In the center of this argument is the bitonic vector to two bitonic
vectors mapping, which preadjustment (!min,!max):#!corr is supposed to
do.
Have I persuaded you of this part, yet?
Let R = [r_1 .. r_m] and F = [f_1 .. f_n] be rising and falling
sequences. Append them into a single vector V' of length (m+n).
Apply a circular shift of length \(c \lt m+n\) to yield a vector V.
Then V is bitonically ordered. There are a few cases:
c=0, 0
\(nest\) is another Parallel Divide and Conquer algorithm called within \(sort\):
Is this familiar yet? It says (\(d_{lr}\)) split the input into left
and right half-vectors, (preadjust) immediately (\(#!corr\))
communicate between corresponding elements of the two subvectors,
\((!min,!max)\) swapping if the left value is higher than the right.
Then split them again, and do correspondent communication and
lower/higher swapping within the each half-subvector.
The result of \((!min,!max):\#!corr\) is to cut up the fall-rise or
rise-fall pattern inherited from the previous stage whether of
\((!min,!max):\#!mirr\) or of a previous \((!min,!max):\#!corr\) into as much as
a rise-fall-rise or a fall-rise-fall based on the new median value and
where in the previous pattern it cuts. But since the next round
divides the subvectors into half yet again, simplifying the possible
up-and-down patterning by half, and since the \(med\) median value
inherits from one level of \(\#!corr\) to another maintaining a certain
relationship between extremes from the previous round, which is enough
to guarantee that at each subdivision, all the high values in the
subvector pair go right and all the low values go left. This having
been shown for any more-significant bit, \(nest\) recurses to the next
less significant bit, and the constraints are inherited again. In
this way I show that the full pre-adjustment PDC \(nest\), combing L and R
from their own address space's high to low bits all the way to the
bottom, sorts on each of those bits, ending with a fully sorted L, or
again R.
Note that the identical \(mirr\) swap operations would occur in a PDC with
a preadjustment function using \()d_{eo},c{eo}\) (so it operates on atoms
in pairs, to start), or on a PDC postadjustment functon using
\(d_{lr},c_{lr}\) (so it operates not until atoms in pairs are available
at the end of all the recursive divisions). George Mou has pointed out that
or to put it in words, postadjust on left-right division/combination
is the same as preadjust on even-odd division/combination. In both
cases, the adjust operation op works first on pairs of related
vectors of length 1, and works last on pairs of length \(|V|/2\). In
the LR case, the atomic vectors appear after completely dividing the
\(2^D\) values \(D\) times, and all the related subvector pairs are
processed after the recursive center is completed (postadjust) while
in the EO case the atomic vectors appear immediately upon the first
even-odd division, and the adjust process applies after the recursive
center is completed (preadjust). But the same operations occur on the
same data, moving from small to large, atomic to maximal, in either
approach. Hence if
Then it follows that
George reports that after presenting monotonic sort to his Yale peers
and professors, Alan Perlis pulled him aside to ask him whether
another version of a second-order PDC could be found with an
alternative nesting relationship, such as \(mirr\) inside \(corr\), etc.
Given the Optimal Mappings paper, it seems clear to me that any permutation of bits
should be able to be applied in a monotonic sort algorithm. The dividing bit \(d\)
can range, instead of 0 to \(D-1\), from the 0'th to the \(D-1\)'th permuted bits,
so long as permuted divide and permuted combine functions are also available.
Just as in hypercube communication, it seems natural that one should
be able to can divide on any bit -- any dimension -- and recombine
later on that bit in the order specified by the given permutation of
bits. If we consider the unsorted values according to their
final-sorted ordinal address values, and consider any bit the lowest
and any bit the highest according to some fixed permutation \(p\), then
\(d_{p}, c_{p}\) will navigate in the permuted order subdividing the
binary hypercube corresponing to all the addresses in the input data
vector. This requires we consider values or addresses as something
more akin to linguistic elements than geometrical ones, and that
instead of making smaller and larger groupings, which depends on the
non-permutation of the address bits, we abandon our intuitions about
what belongs together, and let the algorithm follow correct reasoning.
I'll bet a dollar that the values will shuffle and percolate across
surprising distances but in the end will pop into final order no
matter the permutation. To that end the terminology above was intended
to enable this generalization.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||