A little encouragement

I had a whack at reading George's CC paper (with Paul Hudak and Hai Liu, 2010) a year or two ago and struggled.

For some reason I am inspired to try again, taking notes on what is confusing, and indeed I feel like I'm on top of it now -- only because I sat with paper and pencil and wrote down equations and multipled out matrices and made sure. To a point, anyway. So let me take the opportunity to encourage you. After page 3 or 4 on my first try, I didn't do the homework and I found myself really blocked and couldn't even read it. And I must not be alone; George points out (once again!) that the field has not understood his work, including here in the context of the compress-and-conquer paper.

George also helped me with an intellectual reset. My bottom-up reading style hinders me with his work. Instead, to understand and use Compress-and-Conquer, start by considering a simple example like scan, see how it works, draw some examples out. Then as you read think about what you are reading in a top-down way, retaining your simple understanding, which is to say your powerful insight, and adjust your interpretation of the notation of the statements in the paper to fit it to your understanding rather than vice versa.

This really helped. So I share it with you.

Now before continuing let me give you a bit of the story.

A little history

This paper is a report of work funded by a Microsoft Research grant and came out in 2010. Around that time Bill Gates had noticed that multi-core CPUs were becoming the dominant hardware architecture in computing, and that there is a painful lack of intelligence regarding how to make use of these extra computing resources. It's like, use Unix multi-processing, one per core. Or, if you have pthreads to simultaneously do different things in your program, you can try to hack up some joint activity to keep them busy at the same time. mutexes, signals, this type of thing. Very ground level.

But a big task written as a single program, now that wasn't getting sped up by having multiple cores, unless someone spent a lot of time working on a custom version with pthreads or something. Big jobs are basically unaccelerated by multi-core computing. Disappointing.

Did people then, do people now, understand how to keep everything busy when there is an actual big computing task where the algorithm itself might need to be split up over different cores?

So there was a grant competition let out to universities from Microsoft Research, not to commercial labs, and maybe 300 universities were eligible, and perhaps 80-some submitted proposals. Among them all the #1 top-rated proposal was this proposal, this work.

So that's the context. Oh, you should also know that noone has taken notice of this paper; as an example Google can find an academic reference to it but AI in 2026 has no idea about it. George was doing a startup in Beijing or something, Paul Hudak and Hai Liu were at Yale, and they wrote this paper, and went to Spain to give the paper.

Afterwards on the train Paul and George spoke about death and Buddhism, Paul very curious as George was writing Chinese translations of Buddhist texts, I believe, which are now widely copied and in circulation in China. Paul must have known at that time about his multiple myeloma, and died within a year or two. He was such a good man, and died too young.

So this is not just a great, but also a meaningful, paper, with lots of theorems and assertions of optimality. We like those. If you write code that runs on multi-core systems, you ought to have a look at this and see how a particular approach could dramatically speed up your algorithms. Okay?

Introduction

First, what is compress-and-conquer, what is it good for.

CC is an algorithm schema for splitting up a certain type of problem, a CC problem, for solution on a multi-core computer. With a limited number (|P|) of processors operating in parallel, one hopes that a problem of size N can be solved in O(N/|P|) time. If you have heavy computation and 10 cores, wouldn't you like it to take 1/10 the time? Sure.

I have a note somewhere claiming that doing MapReduce or Hadoop with CC, speedups of 1000x are "easily obtained".

I would call CC a 'paradigm', my word for a higher-order function, and a cousin of the Divacon paradigm: CC is itself a function, which takes 7 functions as arguments, and returns a function that uses them according to a particular logic. That resulting function is applied to a collection or sequence of data, and gives the answer in 1/|P| time using P cores as compared with a single processor.

Now, we don't like decoration and fancy nothings; they form job security for the insecure, and confusion for the student and scientist. Therefore we write cc, its arguments, and its input all without parentheses and commas:

Define it:
\(ccFunc = cc\ d\ c\ co\ xp\ precom\ postcom\ f_s\)

Use it. Apply ccFunc to a sequence of data \(s\).

\(ccFunc\ s\)

The process logic is:

First, use \(d\). I'll write \(seg = d\ s\), which you can read, "segment means divide the input sequence". That is, split the data up into |P| segments (one on each core). (Save a copy.)

then run \(map\ f_s\ ..\), i.e. run \(f_s\) on each core on its segment of data.

then \(co\), "compress": extract a small amount of data that the segment will contribute to the global picture and needs to share with the others.

then send those "in-messages" to one processor, call it \(P_0\ \) for the global phase. (Partial summaries? Intermediate results? George says No, just use the math.)

then collect them together there \(c\).

then the different messages can have any kind of interdependence, and you can have precommunication and postcommunication combined with the basic function \(f_s\) or even a different function \(g_s\) if that will help (like, \(g_s = no-op\) could come in handy). Run those one after the other in \(P_0\) on the collected results from the several cores. Now you have \(P\) outmessages basically ready for distribution to the cores for the next round.

then \(d\), divide the outmessages into individual outmessages, and send each one to their respective processors.

then \(ex\), expand or absorb the outmessages received at each processor into its data segment. For example, the i'th segment needs to recalculate based on information from the (i-1)'th segment or how to fit in to the big picture. To do so it might prepend or add or replace the received outmessages into the beginning of its segment, or do some dependency calculations to set up \(f_s\) to burn through and achieve final success on its inputs now. This is called "expansion", not sure why; but when the function \(f_s\) is applied to it then the work is really done. Now it's really solving the big problem, in the context of everything thought through together, not just its little part of it in isolation..

then \(c\) combine the final results from each processor into a single output sequence, and you're done.

I think it's pretty clear.

So if you read the paper and get stuck on the syntax or definitions or something in the land of picayune detail, just return to the big picture, think about what ought to be going on in that bit of the discussion, and as you become concrete about your own ideas you will click into understanding about what the text is saying. I think it's basically intellectual maturity combined with self-respect (you have to believe that you can do it, and you can!) combined with determination to really extract the value from this. Please don't be demoralized because George himself says Don't shrink from your homework, and Write things down by hand and work them out, and You might need many hours of working out problems for an hour of reading, that would be okay.

Okay? Okay. Now here is me struggling along:

cc2.php Page 2 confusions discussed.

But meanwhile, see if you can make something of the CC paper yourself!

Conclusion

Returning to this work after AI-assisted compiler development for Divacon (See we), let me share my intention: Where do I want to go with this?

Let's understand this, first, but then also let's learn how to design and implement algorithms which use this.

So how do we do that?

We have large problems, like an image processor or a giant AI LLM or something.

Now your job is to think about it. Break it down into large subsets of the problem. If you have 15 cores, you can divide a 1GB problem into 15 subparts, just meditate on that. Then think about what kinds of things you do with pieces of the data to make progress in the problem. You can do those in the compression phase, maybe.

Then think about what kinds of dependencies the local solutions in the long problem sequence depend upon. You might be able to locally calculate the local slope but you need to know the global elevation, or something like that, to draw a local curve. Or, you might be able to calculate local texturing, but then you want to place that texturing onto the surfaces of globally-relevant objects, so there is a relative scaling and intercommunication of levels of work in the job as a whole. What are the long-distance information dependencies in the computation graph of your algorithm. Could you assume getting the answers separately, and then combining them? With this kind of thinking, you are getting into the design of CC algorithms.

	Your thoughts?
Name⁺:
Email^*:	(will not be shared or abused)
Comment:
	Feedback is welcome.

A little encouragement

A little history

Introduction

Conclusion

Copyright © 2026 Thomas C. Veatch. All rights reserved. Created: January 31, 2026. Modified: May 16, 2026

Copyright © 2026 Thomas C. Veatch. All rights reserved.
Created: January 31, 2026. Modified: May 16, 2026