Further work

First I already got excited by this decomposition and thought what if the results are tree structured not just the computation graph? The result is TreeFT. Then I had a chat with George, who says I don't understand pre vs post adjustment and the relationship of communication with them, and go do some basic homework. So I have that to do next. The world of communications architectures in parallel execution environments opens up. I've long known about the world of TCP/IP sockets, but only barely heard of the Message Passing Interface. George's distinguished thesis committee member, Bill Gropp, actually has spent his whole career shepherding the MPI spec along to support parallel programming. Everyone in the Acknowledgements of George's thesis seems to have been a major figure in that Golden Age of Parallel Algorithms at Yale.

So yes, it would be nice to write compilers to know exactly how to assign processes to processors, and when to communicate what, so we mortal coders don't have to think about it. But I think there's a lot of information there to absorb and generalize over, with all the layers of cache memory, and all the speeds and diversity of wiring, network connections, and processor availability and capabilities, which might all be modelled and used to optimally schedule and shepherd all the distributed activities for a parallel algorithm to run fast.

I think that's beyond my homework project, I'm happy to have achieved O(log(N)) in some ideal parallel-processor time disregarding communication.

But here, today George suggests I write a Divacon environment in Python. So a 0.0001 draft, throwing some things on the page:


// CLI interpreter loop.
PDCinterpreter()
  getenv(nProcs);
  getenv(procLayout);
  getenv(memsize);
  getenv(commCost);
  while (1) {
    print "pdc> ";
    read s;
    sscanf(s,"%s %vec",cmd,src);
    executeable = compile(cmd); // convert :, !, #, (..), into corresponding code units, build/return pdf=PDF(..)
    report = run executeable src nProcs procLayout memsize commCost; // run keeping track of time and comm costs.
    print report;
  }
}

// linkable library function
function pointer PDC(div,comb,pre,post,leaf?,atleaf) {
  this = new struct PDCObj(d=div; c=comb; b=pre; a=post; leaf=leaf?; bounce=atleaf);

  confirm function d takes vec arg and divides vec into a tuple of vecs;
  etc.

  method run(input) {
    if (leaf?(input)) return bounce(input);
    else {
      split = div(input);
      return this.combine : this.post : ! this.pre : (split.L, split.R)
    }
  }
}
// or something.
Further ideas might come from: Still it seems we are missing some inner factor correspondence and sharing; we ought to be rotate/scale/add every sample not in a single \(b\) bend alignment but in \(b\) multiples for the different steps just for that sample, and those surely collapse into \(N/b\) overlappings. At least, overlapping samples could be added before scaling the unit vector, saving a multiply, one for each sample for each bend b > 1 after the first full wrap.

And in the world of communications, it seems there ought to be a blocking structure for the FFT coefficient matrix and sample vector so as to increase computational intensity, similar to George's matrix multiply by-blocks algorithms 4.9, 4.10, 4.11. Given some processor memory limit, there should be a way to calculate the computational-intensity-maximizing sub-blocking scheme, also optimizing the tradeoff of number of processors. (It may be that enlisting extra processors at some point is costlier than the benefit thereof. If so, calculating the optimum tradeoff is desireable, then achieving it.)

So, questions remain but coming so far is a step forward, for me.