Here is Mou's Monotonic Sort algorithm, slightly reformatted:
sort = PDC(\(d_{lr}, c_{lr}\), id, !nest : (!min,!max) : #!mirr, atom?, id)
nest = PDC(\(d_{lr}, c_{lr}\), (!min,!max) : #!corr, id, atom?, id)
Inside the postmorphism PDC sort is a premorphism PDC, nest.
How does this miracle work?
Look, see, get some intution
It's fun and informative to watch the data flow as follows in an 8-item
sort where |V| is
the working size of a data vector. Values are shown at beginning,
end, and upon a change; otherwise not. Click the triangle to see it.
7 6 5 4 3 2 1 0
Address in V: | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 | | | |
| Event Outcomes | Function | Level | Event(s) |
| [7 | 6 | 5 | 4 | 3 | 2 | 1 | 0] | sort | |V|=8 | begin |
| [7] | [6] | [5] | [4] | [3] | [2] | [1] | [0] | sort | |V|=1 | \(d_{lr}\) 3x |
| [6] | [7] | [4] | [5] | [2] | [3] | [0] | [1] | sort | |V|=1 | (!min,!max) : !mirr (post) |
| [6 | 7] | [4 | 5] | [2 | 3] | [0 | 1] | sort | |V|=2 | \(c_{lr}\) |
| [5 | 4] | [7 | 6] | [1 | 0] | [3 | 2] | sort | |V|=2 | (!min,!max) : !mirr (post) |
| [4] | [5] | [6] | [7] | [0] | [1] | [2] | [3] | nest | |V|=1 | (!min,!max) : !corr : \(d_{lr}\)(pre) |
| [4 | 5] | [6 | 7] | [0 | 1] | [2 | 3] | nest | |V|=2 | \(c_{lr}\) and done in nest |
| [4 | 5 | 6 | 7] | [0 | 1 | 2 | 3] | sort | |V|=4 | \(c_{lr}\) |
| [3 | 2 | 1 | 0] | [7 | 6 | 5 | 4] | sort | |V|=4 | (!min,!max) : !mirr (post) |
| [1 | 0] | [3 | 2] | [5 | 4] | [7 | 6] | nest | |V|=2 | (!min,!max) : !corr : \(d_{lr}\)(pre) |
| [0] | [1] | [2] | [3] | [4] | [5] | [6] | [7] | nest | |V|=1 | (!min,!max) : !corr : \(d_{lr}\) |
| [0 | 1 | 2 | 3] | [4 | 5 | 6 | 7] | nest | |V|=4 | \(c_{lr}\) 2x, nest done. |
| [0 | 1 | 2 | 3 | 4 | 5 | 6 | 7] | sort | |V|=8 | \(c_{lr}\) 1x. sort done. |
|
See how 7 travels, first completing a local-sort within the
address low bit at \(|V| = 1\) mirror, then the next higher address
bit is made right at \(|V| = 2\) mirror but with 7 in the bottom of its
order, just so that nest can clear up the low bit again at \(|V|=1\) corr.
In this way monotonic sort clears the low bit sort task, then pushes
higher-half values into the higher half but in reverse order: those
travellers are reverse sorted in the higher half, then nest corr moves
them by an address changed in its largest bit: in the higher half
first (since it's a preadjust), down to the lowest bit in the higher
half.
You have the idea of how it flows. Now let's see why it cannot fail.
Technical Explanation and Proof of Correctness
Let's define some terms. First let the input vector be of length
\(|V|=2^D\) for some nonnegative integer \(D\). For instance if there are
\(1024=2^{10}\) values, then \(D=10\).
Second, let there be a divide operation independent of vector values
(hence "polymorphic") that takes a vector of length |V| to two
subvectors of length |V|/2. I find it helpful to refer to these as
related subvectors; in divacon a combine operation will
operate on two related subvectors to make a double-length vector.
Then the address range of \(V\) comprises \(D\) bits, and its ordinal
addresses are numbers \(0\) to \(2^{D}-1\). In binary encoding an
address is a set of bits \(B = \{ b_{D-1}, b_{D-2}, ... b_2, b_1, b_0 \}\)
from most to least significant (leading zeroes retained explicitly).
Such an address range may have its bits permuted by any numbering
system such as from \(0\) to \(D-1\) interpreted as from least
significant to most significant, that is, low to high. High to low is
another permutation, but note that any mapping between \(0..D-1\) and
the various \(D\) bits, represented as a permutation of bits \(B), can
potentially be used to organize the same operations, as George Mou
showed in his Optimal Mappings
paper.
The \(D\) bits of an address may be divided on the \(d\)'th bit in the
numbering system, \(b_d\in\ {0,1}\). Suppose \(d+e=D\) and we are
counting bits from least to most significant. If we know \(d\), then
that tells us where we are in a divide and conquer computation graph,
in which, generally speaking, two subvectors of some (same, power of
2) size are being divided from or combined into one double-size one,
and this is happening with some number of subvector pairs. The depth
\(d\) specifies:
- by the \(D-d-1\) more-significant bits, which pair of subvectors is being looked at at the moment;
- by the \(d\) less-significant bits, the address range within either subvector,
- by the \(1\) dividing bit, \(b_d\), we identify which subvector of the pair is being considered.
To illustrate with an example, suppose for example 16 values in an
input vector are at a certain point in a parallel divide and conquer
algorithm divided into pairs of 4-vectors (2 such pairs will be seen).
Then \(D=4\) (since \(2^4=16\)). The dividing bit is \(b_2\) (so \(d=2\)) leaving on
one side bits \(0..d-1\) for subvector-internal addresses, in this case 00
01 10 11. On the other, more-significant side of the \(d\)'th bit are
bits \(d+1..D-1\) which are used to select a pair of subvectors, in this
case \(d+1=D-1=3\) so \(b_3 = 0\) selects the first pair of 4-vectors, and if
\(b_3 = 1\) that selects the last pair of 4 vectors.
Restating, if we are given a \(D\)-bit address range, and a dividing bit
\(d < D\), then each subvector of a pair of subvectors contains values with
addresses (positions within the subvector) in the range of \(0..d-1\)
bits.
\(b_d\), the dividing bit, having value 0 selects the first (left)
4-vector in a pair, and value 1 selects the second (right) 4-vector in
a pair.
We applying the divacon structure to our problem.
First, we divide the input vector \(V\) recursively in half until
everything is an atomic vector \(|V|=1\). We may say that every atomic
vector is fully sorted within itself. At this point, \(d=0\), and each
subvector has no internal address space, and \(b_0\) specifies the left or
right (atomic) subvector in the recursive algorithm.
Now, we set up our conditions of recurion: we are given a pair of
internally presorted \(2^d\)-length vectors.
Apply the divacon expression, \((!min, !max) : \#!mirr\).
- \(mirr\) mirrors addresses. Given subvectors of length \(2^d\),
any address \(a\) is taken to its opposite address by \(!a=2^d-1-a\).
- \(!mirr\) distributes the \(mirr\) operation over both
subvectors, so that each element in both subvectors communicate with
its communication partner in the other subvector.
- # ! mirr says copy, bring, and adjoin that communication partner's data to the local
value.
- (!min, !max) says if v is in the left subvector then keep the minimum value and discard the other,
while if v is in the right subvector then keep the maximum value and discard the other.
Hence \((!min,!max) : \# ! mirr\) exchanges the values between
communication partners if the right one was smaller than the left one,
and leaves them unchanged if the left is equal or smaller than the
right.
Another bit of terminology. Given a set of finite values of size \(|V|\)
with \(|V|\) even, there must exist a median value \(med\) (not necessarily
unique), such that half the values are above or equal to \(med\), and the
rest of the values are below or equal to \(med\).
Now there are two lemmas that need proved to verify this recursion.
First we join two presorted subvectors L and R in a way which sorts
all the below-median values into L and all the above-median values
into R. If everything is placed into the correct half, then in a sense
we have sorted everything we have access to (the two subvectors being
worked on together) according to their highest bit, the dividing bit.
There may be chaos within L or within R, but from the dividing bit
perspective, everything in the two subvectors is in the correct
subvector.
Secondly, we internally comb L and R separately in such a way that
they become internally fully sorted.
The proposition of the monosort algorithm is that \((!min,!max):\#!mirr\)
accomplishes the first, and that \(!nest\) accomplishes the second. If
so, then applying \((!min,!max):\#!mirr\) will subsort at the dividing bit,
inducing some chaos in the new L and R, while \(!nest\) will subsort from
there so that all the less-significant bits are in order. Once
subsorted between L and R and then within L and within R, we can conjoin L
and R to make a fully-sorted, double-length vector. This satisfies the
recursion requirements for the next level \(d+1\), and hence we can repeat
for \(d=0\) to \(d=D-1\) and we are done when the highest-level pair of
subvectors are between- and within-sorted, then joined, yielding the
final result.
So let's prove these lemmas.
First, we are given equal-length subvectors L and R which are
presorted. Then the count of values in L and R together is an even
number, so \(med\) exists.
Lemma One: \((!min, !max) : \#!mirr\) places all the values below med into the
left subvector and all values above \(med\) into the right subvector.
Proof: by \(mirr\) we associate the first (smallest) in L
with the last (largest) in R and step value by value in opposite
directions, upward (since it is sorted increasingly) in L and downward
in R (since it is also sorted increasingly but we are stepping
backwards along R from the end). While the compared left value
remains smaller than the right value (which might be true for zero or
more values), then \((!min,!max):\#!mirr\) does not swap them.
Continuing up L and down R, at some point (which might be at the very
start) they may cross so that we find the compared left value exceeds
the compared right value (which might be true for zero or more values
-- these are all the logical possibilities); after the crossing point
we set \(med\) to any value between the closest values on the two
sides of the crossing point. Counting onward we continue up L and down
R and after the crossing point, the continually increasing values in L
will always be greater than their mirr communication partners in R
which keep getting smaller as we continue, so \((!min,!max):\#!mirr\)
always swaps them. Since L and R were originally pre-sorted, the rest
will be swapped as on both L and R the values are only getting farther
on the wrong side of \(med\), triggering the swap by an even greater
difference. Therefore all the values in L below \(med\) stay in L
while those in R above \(med\) stay in R, and after the crossing point
conversely, all the values in L above \(med\) move to R while those in
R below \(med\) move to L. Hence after the mirr swap applies all the
way across L and R, the above-median half ends in R and the
below-median half ends in L. QED.
Lemma Two: !nest sorts the !mirr swapped outputs within L and within R
respectively, such that each is fully internally sorted.
Proof Suggestions: Now this is a bit more work and depends on the previous median
being the limit of rising or falling sequences of values within the vector.
Such limits at the turning points of rise-fall or fall-rise value sequences
enables us to infer a little bit about the results of \((!min,!max):\#!corr\).
I persuade myself of it by drawing rising or falling lines
representing sequences of rising or falling values in L and R between
pairs of tall brackets for the inputs and another pair of tall
brackets for the outputs, placing between the brackets crossing axes
on both representing a conceptual half-count vertical line and a
conceptual \(med\) or median horizontal line, and labelling segments
of the two inputs in such a way that I can track what happens to each
fragment of the inputs over on the output side.
Recall that \(!\) means distribution, so \(!nest\) means separately apply \(nest\)
to the left and to the right subvectors. Hence \(!nest\) does not sort
between L and R, only within each.
\(nest\) is another Parallel Divide and Conquer algorithm called within \(sort\):
nest = PDC(\(d_{lr},c_{lr},\) (!min,!max):#!corr,id,atom?,id)
Is this familiar yet? It says split the input into left and right half-vectors,
immediately communicate between corresponding elements of the two subvectors,
swapping if the left value is higher than the right. Then split them again,
and do correspondent communication and lower/higher swapping within the each
half-subvector.
The result of \((!min,!max):\#!corr\) is to cut up the fall-rise or
rise-fall pattern inherited from the previous stage whether of
\((!min,!max):\#!mirr\) or of a previous \((!min,!max):\#!corr\) into as much as
a rise-fall-rise or a fall-rise-fall based on the new median value and
where in the previous pattern it cuts. But since the next round
divides the subvectors into half yet again, simplifying the possible
up-and-down patterning by half, and since the \(med\) median value
inherits from one level of \(\#!corr\) to another maintaining a certain
relationship between extremes from the previous round, which is enough
to guarantee that at each subdivision, all the high values in the
subvector pair go right and all the low values go left. This having
been shown for any more-significant bit, \(nest\) recurses to the next
less significant bit, and the constraints are inherited again. In
this way I show that the full pre-adjustment PDC \(nest\), combing L and R
from their own address space's high to low bits all the way to the
bottom, sorts on each of those bits, ending with a fully sorted L, or
again R.
Hence we step \(d\) from the smallest atomic subvectors (\(d=0\)) to the
largest, half-\(V\) subvectors (\(d=D-1\)) doing \(mirr\) swaps to get things
sorted with respect to bit \(b_d\), and then combing and more finely recombing
with corr swaps between the largest subvectors which with a perfect
tidiness pushes the median-structured value sequences into the correct
left and right sides of increasingly-divided subvectors, all the way
down to atoms again.
Note that the identical \(mirr\) swap operations would occur in a PDC with
a preadjustment function using \()d_{eo},c{eo}\) (so it operates on atoms
in pairs, to start), or on a PDC postadjustment functon using
\(d_{lr},c_{lr}\) (so it operates not until atoms in pairs are available
at the end of all the recursive divisions). George Mou has pointed out that
PDC(\(d_{lr},c_{lr},\)id,op,\(pred_b,f_b\)) = PDC(\(d_{eo},c_{eo}\),op,id,\(pred_b,f_b\))
or to put it in words, postadjust on left-right division/combination
is the same as preadjust on even-odd division/combination. In both
cases, the adjust operation op works first on pairs of related
vectors of length 1, and works last on pairs of length \(|V|/2\). In
the LR case, the atomic vectors appear after completely dividing the
\(2^D\) values \(D\) times, and all the related subvector pairs are
processed after the recursive center is completed (postadjust) while
in the EO case the atomic vectors appear immediately upon the first
even-odd division, and the adjust process applies after the recursive
center is completed (preadjust). But the same operations occur on the
same data, moving from small to large, atomic to maximal, in either
approach. Hence if
\(monosort_{eo} = PDC(d_{eo},c_{eo},!nest:(!min,!max): \#!mirr,id,atom?,id) \)
Then it follows that
\(monosort_{lr} = monosort_{eo}\)
George reports that after presenting monotonic sort to his Yale peers
and professors, Alan Perlis pulled him aside to ask him whether
another version of a second-order PDC could be found with an
alternative nesting relationship, such as \(mirr\) inside \(corr\), etc.
Given the Optimal Mappings paper, it seems clear to me that any permutation of bits
should be able to be applied in a monotonic sort algorithm. The dividing bit \(d\)
can range, instead of 0 to \(D-1\), from the 0'th to the \(D-1\)'th permuted bits,
so long as permuted divide and permuted combine functions are also available.
Just as in hypercube communication, it seems natural that one should
be able to can divide on any bit -- any dimension -- and recombine
later on that bit in the order specified by the given permutation of
bits. If we consider the unsorted values according to their
final-sorted ordinal address values, and consider any bit the lowest
and any bit the highest according to some fixed permutation \(p\), then
\(d_{p}, c_{p}\) will navigate in the permuted order subdividing the
binary hypercube corresponing to all the addresses in the input data
vector. This requires we consider values or addresses as something
more akin to linguistic elements than geometrical ones, and that
instead of making smaller and larger groupings, which depends on the
non-permutation of the address bits, we abandon our intuitions about
what belongs together, and let the algorithm follow correct reasoning.
I'll bet a dollar that the values will shuffle and percolate across
surprising distances but in the end will pop into final order no
matter the permutation. To that end the terminology above was intended
to enable this generalization.
|