|
My own confusions from Page 2 of the Compress-and-Conquer paper are discussed below.
- "Compression": This word has a particular meaning in information
theory, as in "data compression" which means taking a file of a given
size, compressing it to a different file with a smaller size by
removing redundancy, and also being able to restore it to its original
form. That is NOT the meaning to be understood here.
Instead, "Compress" here means "shrink" in a different way, and maybe
two or three ways.
- First, shrink the problem by simply dividing it into |P|
smaller problems, so that then each smaller problem can be handled by
a separate computer or core. From the perspective of each core, it is
a shrunken or compressed problem. Note that each smaller problem must
be solved independently of each other one, thus allowing each core to
do its work in parallel with the others. So compression implies
separation into independently-solveable, separate subproblems.
- Second, once the subproblems are solved by the separate cores
there is now a new problem which is to combine their solutions (sums,
last values, results somehow) into a vastly smaller less populated set
of answers, and this new problem is a very very tiny problem, like,
well, scan through these results and make a running sum on this new
scan. If there are 10 or 20 cores, then 10 or 20 results and your
scan and running sum task is 10 or 20 steps, instead of 2^30 or
whatever might be the size of even the smaller problems handled by
each core. So this new global task is also a vastly shrunken task,
and you could imagine "compression" means to reduce this important
part of the global problem to a much much smaller problem, which can
then be handled in the so-called "global phase". So maybe that's what
compression means here.
- Another potential meaning is maybe too subtle. The problem with
computer science is the Von Neumann Bottleneck, which means everything
is one-step-at-a-time, every line of code is one-step-at-a-time (well
not always always there is recursion after all but that's also pretty
linear), and every thought about how to program anything, is called
functional analysis, and involves taking the problem and breaking it
into smaller and smaller pieces so finally you can do every piece by
one step at a time operations. This is a Very Big Problem not just in
computer science but in human thinking, because it forces us to be
very stupid, and never to understand Bigger Picture thinking. So in
my own opinion, we should reconsider our small-mindedness, abandon the
incremental division of all things into specified onesies, and try to
take a global picture of things. Like maybe start at the top or start
by thinking inside it at an arbitrary level of a divacon graph, where
we are dealing with vectors of unknown size, and process them all at
once, conceptually, by saying everyone talk to everyone and do this
simple same thing, and then make that lead to the next level up or
down the computation graph. This kind of thinking, well, is it bigger
thinking, or is it compressed, smaller thinking? We are all so
frightened of thinking of a big picture that we have to make it even
bigger and even bigger by turning it into a giant pile of
barely-manageable brick-sized pieces. But instead maybe we can
compress this big picture into any single (small!) level of a
many-levels problem and just conceive how one level might work, like,
let the two vectors talk to each other by corresponding indexes, one
element to its corresponding element (which would be So Simple, on the
other hand the vectors lengths could be 2^N with ANY N (say
5,000,000,000), and you just specified the whole thing all happening
in lockstep in one clock step). Or to its mirror-element, here the
beginning talking to there the end, and vice versa.
Now we have different ways of having everything communicate to
everything else all at once, and then maybe do some local operation,
very simple, like pick the bigger one, or add to the one on the right,
or some easy job, but in this compressed vision of vast problems we
can easily just hang it on the scaffold and make 2 4 8 16 32 2^N
processors all do the same thing, and because of the
divide-and-conquer scaffold, there is nothing unspecified, nor even
anything complicated, about it.
Actually the trick has been to not even see the big problem as a big
problem any more, but to compress your vision to the small aspects
like patterns of communication between arrays or local operations
within a single processor, and then when we zoom out and see the whole
scaffold running at the same time, we are amazed to discover that we
have solved a problem at arbitrary scale, and arbitrary means any
scale including very very very very huge scales. So in a way this is
also what's going on in Compress-and-Conquer, which is to conceptually
compress the work of the problem into simple small parts, but not
excluding the communication of results and then their combination into
higher-level but now also small problems. Having this high-level view
makes the problems very small, in a certain particular way, and that
is a potential meaning of "compression" in "Compress-and-Conquer".
But that is just me, no doubt George will consider that to be silly,
as he kind of thinks the world is uneducable. But are you uneducable,
or will you try to find these angles of leverage, these special
encompassing perspectives where the whole field lines up, and you can
farm the whole field in a few simple steps. The point is, those
perspectives are not huge and vast, they are small, compressed
perspectives, but they have Insight in them, and they Scale. So
that's the implicit suggestion here.
Okay?
After visiting George today and discussing this, he means #2.
Compression actually is the co function which has type co → S b
→ S c. co's input is a collection of b's which are the outputs
of the regular sequential function fs, all of them in a
single processor. co's outputs are whatever message or information
packet is required to be given to the global phase, such as a sum or
summary or unit section, or something. Each processor runs co on its
fs-processed contents (a collection of b's), and sends a
message (just one c for each processor, but all together a collection
or Set of c's: S c), to the central or global processor P0
for the next phase. This next phase will calculate the whole
top-level program in extremely short order, handling only |P| packets
to make the top-level solution view, and that is a vastly smaller
problem to solve than the whole original problem, because it only uses
these summaries or messages, |P| of them, to create that big picture
view.
- The algorithm description to me is incomprehensible without
reference to an actual case in which it could be used. Let's try the
scan problem, which is indeed quite trivial but it lets us visualize
and understand what the paper is talking about in a simple and clear
case.
scan [1 1 1 1 1 1 1 1 1] is [1 2 3 4 5 6 7 8 9], which is a running
sum moving left to right over the input vector. This could be divided
into three problems each of the form, scan [1 1 1], which gives a sum
for the subproblem (3) and a vector of partial results [1 2 3].
Dividing it into three is called "compression".
Next there is a so-called "global" phase, which runs the same function
over the subproblem results (in this example, the sums): scan [3 3 3]
which yields [3 6 9].
Finally there is a so-called expansion phase, in which the "global"
phase results are taken in, once again, by the separate cores, which
models communication between the cores. In particular the global
result vector [3 6 9] is shifted right one (resulting in [0 3 6], and
its members are supplied respectively to the several cores, which can
then re-apply this new input to all its partial results sequentially, so
that:
0 + [ 1 2 3 ] = [1 2 3]
3 + [ 1 2 3 ] = [4 5 6]
6 + [ 1 2 3 ] = [7 8 9]
and the final results appended to produce the final result vector
[1 2 3 4 5 6 7 8 9].
Notice that the last output [9] did NOT require sequential evaluation
of 1 through 8. the problem was subdivided, the last bit was handled
sequentially, then the summaries were passed around (between sending and receiving
processors), and then again the last core did a little computation.
For huge data vectors this will produce a huge time savings, mostly
because of reducing the communication event count and sequential
dependencies. Less waiting around because everyone can do some of
their work independently, then just pass around some local results,
(indeed an unboundedly-small fraction of the size of a segment)
process them for a big picture, and then apply the big picture (once
again, in very few messages) back to the different cores each of which
has a small job to do again. That's the idea.
- '::': This double-colon symbol, apparently everyone already knows
(except me) is a type signature marker: A :: B says A is of type B.
- '→': The right-arrow is more type signature decoration,
and it is essential to understand it in sequences of more than one.
A :: B → C → D says A is a certain type of thing,
namely the kind of thing that takes as input something of type B, and then
takes as input something else of type C, and then outputs something of type D.
For example a function gives(Tom,Mary,Gift) { ... return True; } would have the type signature
gives :: Person → Person → Thing → Boolean
Essential is the sequencing of arguments, they have to come in one at
a time, and furthermore they merely transform the function into a
slightly more decorated function, one at a time, rather than doing
anything, and not-at-all-helpful is this Haskellian insistence that
the intake of arguments is to be written identically as the emission
of outputs. Confusing and seemingly to me, pointless. But what do I
know I'm just a phonetician.
Now you are ready to read
cc :: (...) → (...) → ... S a → S b
This means cc is a function and it has the type signature of taking
one function as an argument, then another, then another, .... finally
taking a collection or Set of objects S a, and outputting a different
Set of objects S b.
Secretly hidden in this expression is the idea that cc is a higher
order function: it is a function that takes functions as arguments and
returns another function. The returned function takes S a as its input
and gives S b as its output.
Sorry that was not obvious to me.
But now, there you go!
- Next we have an inscrutable string
cc d c co xp comg comh fs = ...
this is still just type signature. Each of the elements d
.. fs are implicitly asserted to be functions of the
corresponding types in the above type signature. Yes you just have to
know that. But it's a sweet (I mean concise) way of making a lot of
type declarations, for example now we know that fs takes in
a collection of a's and outputs a collection of b's, because that's
the 7th function type among the functions that cc accepts as
arguments.
- Now we start to get to the root of confusion.
cc d ... s = # this is a comment, #-initial
let seg = d s # this says just remember that we have defined seg as d s
# noting that d has type (∀ a . S a → [S a])
# and that s must have type S a (a set of a's)
# and that they fit together taking all the a's in the S a that is s
# and making a vector [S a] out ot them.
# this means, take the inputs, segment them into a's, and make
# a vector of a's, or perhaps a vector of vectors of a's. Don't know yet.
But seg being d s means it is the OUTPUT of d applied to s, so it is
actually the vector or vector of vectors [S a]. This does make sense,
because our approach is to segment the huge input problem into
segments each being a smaller problem; and we have a vector of these smaller
problems or segments. What is [S a]? I still don't know.
- Now we start to get to the root of confusion. let seg = d s is just the
first of several declarations. The function also defines a bunch of other
terms:
pre = map (co . fs) seg # you're supposed to know what map means already.
# This says separately apply fs followed by co
# to each of the segments in seg.
# that would be the compression phase: separately
# doing fs on each core, then summarizing with co.
# co in the scan case just takes the last value of each
Okay. And then:
-
core = (d.comh.fs.comg.c) pre # now that we are completely familiar with pre,
# we can immediately use it to define core, great.
# Now core is composition of a bunch of functions
# applied to pre. Again pre is the summaries,
# so first we apply c then comg then fs then ... etc.
Let me put words to this cluster. we have all the summaries for each
of the divided subproblems, that's pre. We use c to combine them into
some kind of agglomeration, like a pair of related results as in
divacon, or perhaps just a |P| length vector of intermediate results.
Then we apply comg, the pre-f communication function, which associates
with each item in the agglomeration information taken from some other
item in the agglomeration. To sort into increasing order, for
example, comg might communicate the contents of each item between
pairs in the agglomeration, so that then the fs function can choose
the smaller for the left-hand item and the larger for the right-hand
item. As an example. comg does communication BEFORE f applies.
Okay then fs applies, this is the normal sequential function that also
operated within each of the cores on their subsets of the total input
data vector, but now it is operating on the summaries. Instead of
scan [1 1 1] yielding [1 2 3], fs now carries out, in our example,
scan [3 3 3] (those being summaries) yielding [3 6 9].
Then there is a post-operation communication that can apply, in this
case a shift-right where every position takes in lockstep from its
left neighbor, or zero if none.
Finally there is a division operation again, so that these adjusted
elements are not stuck inside a vector but can be applied to the
different processors and their data separately. Whew.
So in short core is not the summaries (that's pre, like [3 3 3]), but
the reprocessed summaries like [0 3 6]. Great.
- We are not out of the trees, but maybe there is light.
post = map (fs . xp) (zip core seg)
in c post
post is defined as applying fs.xp separately to each of the results of
zipping core and seg together.
My brain is about to melt. Zip is nowhere defined in this paper, but
used 11 times.
On the other hand zip is defined in George's thesis, p61-62, not
without drama. Supposing you define a data structure of M, say 3,
fields, and you have M=3 arrays of length N, then zip can construct a
single array also of length N, but now each element of the zipped
array is one of your structures containing M=3 fields taken from the
corresponding array elements. That is what makes sense to me, and
that is what he says. But then he throws the cards in the air:
For example with B as above ("an array of the same shape"), we have
B i→ = struct (A0 i→, ... Am-1 i→);
i→ is an array of indices, hence A*
i→ are arrays of A's, hence struct (A0
i→, ... Am-1 i→) is a
struct of arrays not an array of structs. Hence my mind has indeed
melted down entirely. Will someone explain it to me?
Tell you what, I'm just going to pretend I never read that, and go
with my intuition.
- So let's go back to the definition where it says (zip core seg).
Now core is a collection of reprocessed summaries as defined and
discussed previously, and seg is the original collection of original
segments that were fed separately to each of the processors in the
"pre = map ... seg" step,
So then using my intuition for zip, (zip core seg) is now a collection
of 2-element structs, one from core and one from seg. Great, we have
a reprocessed summary and we have a whole pile of original data that
this processor has seen before.
There is more to do, but I'm getting zombified. We have to check the
type of the xp function (a function taking a set of d's and a set of a's
and returning a set of a's; is that meaningful to you? Not me; although our
inputs are indeed pairs, one from core (are those the d's?) and one from
seg (those are surely the (S a)'s). Expand or xp here occur many times in this
paper but never is it clarified or exemplified unambiguously.
"Expansion phase: c . map (fs . xp) . zip" is different from "expanded
by function xp". I *think* xp takes the struct with the reprocessed
summary and the data, prepends one to the other, and pushes the
1-longer data array out to each of the cores. That would yield "6+[1
1 1] in the 3rd core in our small scan example, which by a small
modification of fs applied to this would yield [7 8 9], the desired
outcome. Or perhaps xp (6 , [1 1 1]) obviates modifying fs by
producing [7 1 1], which scans to [7 8 9] without change. I'll take
that as evident, but the transfer of this data to the separate cores,
and the location in which the xp function is carried out, is not
specified by all this math, we engineers have to decide how to
implement what the abstract layer demands. Me, I'd keep the seg
arrays in the cores for later access, have xp both copy the
reprocessed results to each core AND add it to the first element of
that core's seg array, and then it works. But that's just me.
- Breaking into the light we are done if we can interpret "in c post".
in c post
in means that "c post" uses definitions above in itself.
c is defined as the combination function, presumeably that concatenates results
from all the cores in order to make a final output.
post is defined above as map (fs . xp) (zip core seg), so it
is the set of outputs of separately applying fs . xp to the pairs of
reprocessed results from core and data segments in seg. Well
but then core needed to be also defined, which it is with the global phase
discussion, of course that includes the initial division and sequential
processing within each processor, and summary with the co function. The whole
hierarchy of prerequisites unrolls as c and post are expanded into their definitions,
and their definitions further expanded into their definitions, all the layers
until we get this whole thing applied to the input collection of data s.
Did I get it right? It works for scan the way I think it should.
Yes.
|