Compress-and-Conquer

Study Notes

On Mou
on Divide and Conquer

My own confusions from Page 2 of the Compress-and-Conquer paper are discussed below.

  • "Compression": This word has a particular meaning in information theory, as in "data compression" which means taking a file of a given size, compressing it to a different file with a smaller size by removing redundancy, and also being able to restore it to its original form. That is NOT the meaning to be understood here.

    Instead, "Compress" here means "shrink" in a different way, and maybe two or three ways.

    • First, shrink the problem by simply dividing it into |P| smaller problems, so that then each smaller problem can be handled by a separate computer or core. From the perspective of each core, it is a shrunken or compressed problem. Note that each smaller problem must be solved independently of each other one, thus allowing each core to do its work in parallel with the others. So compression implies separation into independently-solveable, separate subproblems.

    • Second, once the subproblems are solved by the separate cores there is now a new problem which is to combine their solutions (sums, last values, results somehow) into a vastly smaller less populated set of answers, and this new problem is a very very tiny problem, like, well, scan through these results and make a running sum on this new scan. If there are 10 or 20 cores, then 10 or 20 results and your scan and running sum task is 10 or 20 steps, instead of 2^30 or whatever might be the size of even the smaller problems handled by each core. So this new global task is also a vastly shrunken task, and you could imagine "compression" means to reduce this important part of the global problem to a much much smaller problem, which can then be handled in the so-called "global phase". So maybe that's what compression means here.

    • Another potential meaning is maybe too subtle. The problem with computer science is the Von Neumann Bottleneck, which means everything is one-step-at-a-time, every line of code is one-step-at-a-time (well not always always there is recursion after all but that's also pretty linear), and every thought about how to program anything, is called functional analysis, and involves taking the problem and breaking it into smaller and smaller pieces so finally you can do every piece by one step at a time operations. This is a Very Big Problem not just in computer science but in human thinking, because it forces us to be very stupid, and never to understand Bigger Picture thinking. So in my own opinion, we should reconsider our small-mindedness, abandon the incremental division of all things into specified onesies, and try to take a global picture of things. Like maybe start at the top or start by thinking inside it at an arbitrary level of a divacon graph, where we are dealing with vectors of unknown size, and process them all at once, conceptually, by saying everyone talk to everyone and do this simple same thing, and then make that lead to the next level up or down the computation graph. This kind of thinking, well, is it bigger thinking, or is it compressed, smaller thinking? We are all so frightened of thinking of a big picture that we have to make it even bigger and even bigger by turning it into a giant pile of barely-manageable brick-sized pieces. But instead maybe we can compress this big picture into any single (small!) level of a many-levels problem and just conceive how one level might work, like, let the two vectors talk to each other by corresponding indexes, one element to its corresponding element (which would be So Simple, on the other hand the vectors lengths could be 2^N with ANY N (say 5,000,000,000), and you just specified the whole thing all happening in lockstep in one clock step). Or to its mirror-element, here the beginning talking to there the end, and vice versa.

      Now we have different ways of having everything communicate to everything else all at once, and then maybe do some local operation, very simple, like pick the bigger one, or add to the one on the right, or some easy job, but in this compressed vision of vast problems we can easily just hang it on the scaffold and make 2 4 8 16 32 2^N processors all do the same thing, and because of the divide-and-conquer scaffold, there is nothing unspecified, nor even anything complicated, about it.

      Actually the trick has been to not even see the big problem as a big problem any more, but to compress your vision to the small aspects like patterns of communication between arrays or local operations within a single processor, and then when we zoom out and see the whole scaffold running at the same time, we are amazed to discover that we have solved a problem at arbitrary scale, and arbitrary means any scale including very very very very huge scales. So in a way this is also what's going on in Compress-and-Conquer, which is to conceptually compress the work of the problem into simple small parts, but not excluding the communication of results and then their combination into higher-level but now also small problems. Having this high-level view makes the problems very small, in a certain particular way, and that is a potential meaning of "compression" in "Compress-and-Conquer".

      But that is just me, no doubt George will consider that to be silly, as he kind of thinks the world is uneducable. But are you uneducable, or will you try to find these angles of leverage, these special encompassing perspectives where the whole field lines up, and you can farm the whole field in a few simple steps. The point is, those perspectives are not huge and vast, they are small, compressed perspectives, but they have Insight in them, and they Scale. So that's the implicit suggestion here.

      Okay?

      After visiting George today and discussing this, he means #2. Compression actually is the co function which has type co → S b → S c. co's input is a collection of b's which are the outputs of the regular sequential function fs, all of them in a single processor. co's outputs are whatever message or information packet is required to be given to the global phase, such as a sum or summary or unit section, or something. Each processor runs co on its fs-processed contents (a collection of b's), and sends a message (just one c for each processor, but all together a collection or Set of c's: S c), to the central or global processor P0 for the next phase. This next phase will calculate the whole top-level program in extremely short order, handling only |P| packets to make the top-level solution view, and that is a vastly smaller problem to solve than the whole original problem, because it only uses these summaries or messages, |P| of them, to create that big picture view.
  • The algorithm description to me is incomprehensible without reference to an actual case in which it could be used. Let's try the scan problem, which is indeed quite trivial but it lets us visualize and understand what the paper is talking about in a simple and clear case.

    scan [1 1 1 1 1 1 1 1 1] is [1 2 3 4 5 6 7 8 9], which is a running sum moving left to right over the input vector. This could be divided into three problems each of the form, scan [1 1 1], which gives a sum for the subproblem (3) and a vector of partial results [1 2 3]. Dividing it into three is called "compression".

    Next there is a so-called "global" phase, which runs the same function over the subproblem results (in this example, the sums): scan [3 3 3] which yields [3 6 9].

    Finally there is a so-called expansion phase, in which the "global" phase results are taken in, once again, by the separate cores, which models communication between the cores. In particular the global result vector [3 6 9] is shifted right one (resulting in [0 3 6], and its members are supplied respectively to the several cores, which can then re-apply this new input to all its partial results sequentially, so that:

    0 + [ 1 2 3 ] = [1 2 3]
    3 + [ 1 2 3 ] = [4 5 6]
    6 + [ 1 2 3 ] = [7 8 9]
    

    and the final results appended to produce the final result vector [1 2 3 4 5 6 7 8 9].

    Notice that the last output [9] did NOT require sequential evaluation of 1 through 8. the problem was subdivided, the last bit was handled sequentially, then the summaries were passed around (between sending and receiving processors), and then again the last core did a little computation.

    For huge data vectors this will produce a huge time savings, mostly because of reducing the communication event count and sequential dependencies. Less waiting around because everyone can do some of their work independently, then just pass around some local results, (indeed an unboundedly-small fraction of the size of a segment) process them for a big picture, and then apply the big picture (once again, in very few messages) back to the different cores each of which has a small job to do again. That's the idea.

  • '::': This double-colon symbol, apparently everyone already knows (except me) is a type signature marker: A :: B says A is of type B.

  • '→': The right-arrow is more type signature decoration, and it is essential to understand it in sequences of more than one. A :: B → C → D says A is a certain type of thing, namely the kind of thing that takes as input something of type B, and then takes as input something else of type C, and then outputs something of type D. For example a function gives(Tom,Mary,Gift) { ... return True; } would have the type signature

    gives :: Person → Person → Thing → Boolean

    Essential is the sequencing of arguments, they have to come in one at a time, and furthermore they merely transform the function into a slightly more decorated function, one at a time, rather than doing anything, and not-at-all-helpful is this Haskellian insistence that the intake of arguments is to be written identically as the emission of outputs. Confusing and seemingly to me, pointless. But what do I know I'm just a phonetician.

    Now you are ready to read

    cc :: (...) → (...) → ... S a → S b

    This means cc is a function and it has the type signature of taking one function as an argument, then another, then another, .... finally taking a collection or Set of objects S a, and outputting a different Set of objects S b.

    Secretly hidden in this expression is the idea that cc is a higher order function: it is a function that takes functions as arguments and returns another function. The returned function takes S a as its input and gives S b as its output.

    Sorry that was not obvious to me.

    But now, there you go!

  • Next we have an inscrutable string

    cc d c co xp comg comh fs = ...

    this is still just type signature. Each of the elements d .. fs are implicitly asserted to be functions of the corresponding types in the above type signature. Yes you just have to know that. But it's a sweet (I mean concise) way of making a lot of type declarations, for example now we know that fs takes in a collection of a's and outputs a collection of b's, because that's the 7th function type among the functions that cc accepts as arguments.

  • Now we start to get to the root of confusion.

    cc d ... s =    # this is a comment, #-initial 
    let seg = d s # this says just remember that we have defined seg as d s # noting that d has type (∀ a . S a → [S a]) # and that s must have type S a (a set of a's) # and that they fit together taking all the a's in the S a that is s # and making a vector [S a] out ot them. # this means, take the inputs, segment them into a's, and make # a vector of a's, or perhaps a vector of vectors of a's. Don't know yet.

    But seg being d s means it is the OUTPUT of d applied to s, so it is actually the vector or vector of vectors [S a]. This does make sense, because our approach is to segment the huge input problem into segments each being a smaller problem; and we have a vector of these smaller problems or segments. What is [S a]? I still don't know.

  • Now we start to get to the root of confusion. let seg = d s is just the first of several declarations. The function also defines a bunch of other terms:

    pre = map (co . fs) seg             # you're supposed to know what map means already.
          	      	    		    # This says separately apply fs followed by co
    				    # to each of the segments in seg.
    				    # that would be the compression phase: separately
    				    # doing fs on each core, then summarizing with co.
    				    # co in the scan case just takes the last value of each
    

    Okay. And then:

  • core = (d.comh.fs.comg.c) pre	    # now that we are completely familiar with pre,
           			  	    # we can immediately use it to define core, great.
    				    # Now core is composition of a bunch of functions
    				    # applied to pre. Again pre is the summaries,
    				    # so first we apply c then comg  then fs then ... etc.
    

    Let me put words to this cluster. we have all the summaries for each of the divided subproblems, that's pre. We use c to combine them into some kind of agglomeration, like a pair of related results as in divacon, or perhaps just a |P| length vector of intermediate results. Then we apply comg, the pre-f communication function, which associates with each item in the agglomeration information taken from some other item in the agglomeration. To sort into increasing order, for example, comg might communicate the contents of each item between pairs in the agglomeration, so that then the fs function can choose the smaller for the left-hand item and the larger for the right-hand item. As an example. comg does communication BEFORE f applies.

    Okay then fs applies, this is the normal sequential function that also operated within each of the cores on their subsets of the total input data vector, but now it is operating on the summaries. Instead of scan [1 1 1] yielding [1 2 3], fs now carries out, in our example, scan [3 3 3] (those being summaries) yielding [3 6 9].

    Then there is a post-operation communication that can apply, in this case a shift-right where every position takes in lockstep from its left neighbor, or zero if none.

    Finally there is a division operation again, so that these adjusted elements are not stuck inside a vector but can be applied to the different processors and their data separately. Whew.

    So in short core is not the summaries (that's pre, like [3 3 3]), but the reprocessed summaries like [0 3 6]. Great.

  • We are not out of the trees, but maybe there is light.

    post = map (fs . xp) (zip core seg)  
    in c post
    

    post is defined as applying fs.xp separately to each of the results of zipping core and seg together.

    My brain is about to melt. Zip is nowhere defined in this paper, but used 11 times.

    On the other hand zip is defined in George's thesis, p61-62, not without drama. Supposing you define a data structure of M, say 3, fields, and you have M=3 arrays of length N, then zip can construct a single array also of length N, but now each element of the zipped array is one of your structures containing M=3 fields taken from the corresponding array elements. That is what makes sense to me, and that is what he says. But then he throws the cards in the air:

    For example with B as above ("an array of the same shape"), we have

    B i = struct (A0 i, ... Am-1 i);

    i is an array of indices, hence A* i are arrays of A's, hence struct (A0 i, ... Am-1 i) is a struct of arrays not an array of structs. Hence my mind has indeed melted down entirely. Will someone explain it to me?

    Tell you what, I'm just going to pretend I never read that, and go with my intuition.

  • So let's go back to the definition where it says (zip core seg). Now core is a collection of reprocessed summaries as defined and discussed previously, and seg is the original collection of original segments that were fed separately to each of the processors in the "pre = map ... seg" step,

    So then using my intuition for zip, (zip core seg) is now a collection of 2-element structs, one from core and one from seg. Great, we have a reprocessed summary and we have a whole pile of original data that this processor has seen before.

    There is more to do, but I'm getting zombified. We have to check the type of the xp function (a function taking a set of d's and a set of a's and returning a set of a's; is that meaningful to you? Not me; although our inputs are indeed pairs, one from core (are those the d's?) and one from seg (those are surely the (S a)'s). Expand or xp here occur many times in this paper but never is it clarified or exemplified unambiguously.

    "Expansion phase: c . map (fs . xp) . zip" is different from "expanded by function xp". I *think* xp takes the struct with the reprocessed summary and the data, prepends one to the other, and pushes the 1-longer data array out to each of the cores. That would yield "6+[1 1 1] in the 3rd core in our small scan example, which by a small modification of fs applied to this would yield [7 8 9], the desired outcome. Or perhaps xp (6 , [1 1 1]) obviates modifying fs by producing [7 1 1], which scans to [7 8 9] without change. I'll take that as evident, but the transfer of this data to the separate cores, and the location in which the xp function is carried out, is not specified by all this math, we engineers have to decide how to implement what the abstract layer demands. Me, I'd keep the seg arrays in the cores for later access, have xp both copy the reprocessed results to each core AND add it to the first element of that core's seg array, and then it works. But that's just me.

  • Breaking into the light we are done if we can interpret "in c post".

    in c post
    

    in means that "c post" uses definitions above in itself.

    c is defined as the combination function, presumeably that concatenates results from all the cores in order to make a final output.

    post is defined above as map (fs . xp) (zip core seg), so it is the set of outputs of separately applying fs . xp to the pairs of reprocessed results from core and data segments in seg. Well but then core needed to be also defined, which it is with the global phase discussion, of course that includes the initial division and sequential processing within each processor, and summary with the co function. The whole hierarchy of prerequisites unrolls as c and post are expanded into their definitions, and their definitions further expanded into their definitions, all the layers until we get this whole thing applied to the input collection of data s.

    Did I get it right? It works for scan the way I think it should.

    Yes.

Your thoughts?
(will not be shared or abused)
Comment:
                                        Feedback is welcome.
Copyright © 2026 Thomas C. Veatch. All rights reserved.
Created: January 31, 2026