A little encouragementI had a whack at reading George's CC paper (with Paul Hudak and Hai Liu, 2010) a year or two ago and struggled.For some reason I am inspired to try again, taking notes on what is confusing, and indeed I feel like I'm on top of it now -- only because I sat with paper and pencil and wrote down equations and multipled out matrices and made sure. To a point, anyway. So let me take the opportunity to encourage you. After page 3 or 4 on my first try, I didn't do the homework and I found myself really blocked and couldn't even read it. And I must not be alone; George points out (once again!) that the field has not understood his work, including here in the context of the compress-and-conquer paper. George also helped me with an intellectual reset. My bottom-up reading style hinders me with his work. Instead, to understand and use Compress-and-Conquer, start by considering a simple example like scan, see how it works, draw some examples out. Then as you read think about what you are reading in a top-down way, retaining your simple understanding, which is to say your powerful insight, and adjust your interpretation of the notation of the statements in the paper to fit it to your understanding rather than vice versa. This really helped. So I share it with you. Now before continuing let me give you a bit of the story.
A little historyThis paper is a report of work funded by a Microsoft Research grant and came out in 2010. Around that time Bill Gates had noticed that multi-core CPUs were becoming the dominant hardware architecture in computing, and that there is a painful lack of intelligence regarding how to make use of these extra computing resources. It's like, use Unix multi-processing, one per core. Or, if you have pthreads to simultaneously do different things in your program, you can try to hack up some joint activity to keep them busy at the same time. mutexes, signals, this type of thing. Very ground level.But a big task written as a single program, now that wasn't getting sped up by having multiple cores, unless someone spent a lot of time working on a custom version with pthreads or something. Big jobs are basically unaccelerated by multi-core computing. Disappointing. Did people then, do people now, understand how to keep everything busy when there is an actual big computing task where the algorithm itself might need to be split up over different cores? So there was a grant competition let out to universities from Microsoft Research, not to commercial labs, and maybe 300 universities were eligible, and perhaps 80-some submitted proposals. Among them all the #1 top-rated proposal was this proposal, this work. So that's the context. Oh, you should also know that noone has taken notice of this paper; as an exmaple Google can find an academic reference to it but AI in 2026 has no idea about it. George was doing a startup in Beijing or something, Paul Hudak and Hai Liu were at Yale, and they wrote this paper, and went to Spain to give the paper. Afterwards on the train Paul and George spoke about death and Buddhism, Paul very curious as George was writing Chinese translations of Buddhist texts, I believe, which are now widely copied and in circulation in China. Paul must have known at that time about his multiple myeloma, and died within a year or two. He was such a good man, and died too young. So this is a great, and a meaningful paper, with lots of theorems and assertions of optimality. We like those. If you write code that runs on multi-core systems, you ought to have a look at this and see how a particular approach could dramatically speed up your algorithms. Okay?
IntroductionFirst, what is compress-and-conquer, what is it good for.CC is an algorithm schema for splitting up a certain type of problem, a CC problem, for solution on a multi-core computer. With a limited number (|P|) of processors operating in parallel, one hopes that a problem of size N can be solved in O(N/|P|) time. If you have heavy computation and 10 cores wouldn't you like it to take 1/10 the time? Sure. I have a note claiming that doing MapReduce or Hadoop with CC, speedups of 1000x are "easily obtained:. I would call CC a 'paradigm', my word for a higher-order function, and a cousin of the Divacon paradigm: CC is itself a function, which takes 7 functions as arguments, and returns a function that uses them according to a particular logic. That resulting function is applied to a collection or sequence of data, and gives the answer in 1/|P| time using P cores as compared with a single processor. Now, we don't like decoration and fancy nothings; they form job security for the insecure, and confusion for the student and scientist. Therefore we write cc, its arguments, and its input all without parentheses and commas:
\(ccFunc = cc\ d\ c\ co\ xp\ precom\ postcom\ f_s\)The logic is:
So if you read it and get stuck on the syntax or definitions or something in the land of picayune detail, just return to the big picture, think about what ought to be going on in that bit of the discussion, and as you become concrete about your own ideas you will click into understanding about what the text is saying. I think it's basically intellectual maturity combined with self-respect (you have to believe that you can do it, and you can!) combined with determination to really extract the value from this. Please don't be demoralized because George himself says Don't shrink from your homework, and Write things down by hand and work them out, and You might need many hours of working out problems for an hour of reading, that would be okay. Okay? Okay. Now here is me struggling along:
|