Further homework

<H1>Further homework</H1>

First I already got excited by this decomposition and thought what if the
results are tree structured not just the computation graph?  The result is
<A HREF="https://TomVeatch.com/sw/treeft.php">TreeFT</A>.  Then I had a chat with George,
who says I don't understand pre vs post adjustment and the relationship of communication
with them, and go do some basic homework.  So I have that to do next.

The world of communications architectures in parallel execution
environments opens up.  I've long known about the world of TCP/IP
sockets, but only barely heard of the Message Passing Interface.
George's distinguished thesis committee member, Bill Gropp, actually
has spent his whole career shepherding the MPI spec along to support
parallel programming. Everyone in the Acknowledgements of George's
thesis seems to have been a major figure in that Golden Age of
Parallel Algorithms at Yale.<P>

So yes, it would be nice to write compilers to know exactly how to
assign processes to processors, and when to communicate what, so we
mortal coders don't have to think about it. But I think there's a lot
of information there to absorb and generalize over, with all the
layers of cache memory, and all the speeds and diversity of wiring,
network connections, and processor availability and capabilities,
which might all be modelled and used to optimally schedule and
shepherd all the distributed activities for a parallel algorithm to
run fast.<P>

I think that's beyond my homework project, I'm happy to have achieved
O(log(N)) in some ideal parallel-processor time disregarding
communication.<P>

But here, today George suggests I write a Divacon environment in Python.
So a 0.0001 draft, throwing some things on the page:<P>

<PRE>

// CLI interpreter loop.
PDCinterpreter()
  getenv(nProcs);
  getenv(procLayout);
  getenv(memsize);
  getenv(commCost);
  while (1) {
    print "pdc> ";
    read s;
    sscanf(s,"%s %vec",cmd,src);
    executeable = compile(cmd); // convert :, !, #, (..), into corresponding code units, build/return pdf=PDF(..)
    report = run executeable src nProcs procLayout memsize commCost; // run keeping track of time and comm costs.
    print report;
  }
}

// linkable library function
function pointer PDC(div,comb,pre,post,leaf?,atleaf) {
  this = new struct PDCObj(d=div; c=comb; b=pre; a=post; leaf=leaf?; bounce=atleaf);

  confirm function d takes vec arg and divides vec into a tuple of vecs;
  etc.

  method run(input) {
    if (leaf?(input)) return bounce(input);
    else {
      split = div(input);
      return this.combine : this.post : ! this.pre : (split.L, split.R)
    }
  }
}
// or something.
</PRE>

Further ideas might come from:
<UL style="list-style-type:disc">
<LI> Bit reversal, relevant here but it is not obvious to me how to work it in.
<LI> considering the inverse FT. Insert -1/N somewhere, I think that's all.

<LI> decimation-in-time versus decimation-in-frequency: in the former
     we fix b and loop across samples.  In the latter, we fix s and
     loop across bends/frequencies b. Should be the same, or a very
     symmetric, algorithm.

<LI> What happens if you try divide with \(d_{lr}\), or other combine operations?
     George says (something like) \(lr\) and \(eo\) differ by being post adjustment and pre-adjustment
     operations.  That doesn't make sense though since those are divide and combine operations.  
     So I'll have to study and understand it better before saying more.

<LI> Consider memoization.<P>
</UL>

Still it seems we are missing some inner factor correspondence and
sharing; we ought to be rotate/scale/add every sample not in a single
\(b\) bend alignment but in \(b\) multiples for the different steps
just for that sample, and those surely collapse into \(N/b\)
overlappings.  At least, overlapping samples could be added before
scaling the unit vector, saving a multiply, one for each sample for
each bend b > 1 after the first full wrap.  <P>

And in the world of communications, it seems there ought to be a
blocking structure for the FFT coefficient matrix and sample vector so
as to increase computational intensity, similar to George's matrix
multiply by-blocks algorithms 4.9, 4.10, 4.11.  Given some processor
memory limit, there should be a way to calculate the
computational-intensity-maximizing sub-blocking scheme, also
optimizing the tradeoff of number of processors. (It may be that
enlisting extra processors at some point is costlier than the benefit
thereof.  If so, calculating the optimum tradeoff is desireable,
then achieving it.)<P>

So, questions remain but coming so far is a step forward, for me.<P>