So yes, it would be nice to write compilers to know exactly how to assign processes to processors, and when to communicate what, so we mortal coders don't have to think about it. But I think there's a lot of information there to absorb and generalize over, with all the layers of cache memory, and all the speeds and diversity of wiring, network connections, and processor availability and capabilities, which might all be modelled and used to optimally schedule and shepherd all the distributed activities for a parallel algorithm to run fast.
I think that's beyond my homework project, I'm happy to have achieved O(log(N)) in some ideal parallel-processor time disregarding communication.
But here, today George suggests I write a Divacon environment in Python. So a 0.0001 draft, throwing some things on the page:
// CLI interpreter loop. PDCinterpreter() getenv(nProcs); getenv(procLayout); getenv(memsize); getenv(commCost); while (1) { print "pdc> "; read s; sscanf(s,"%s %vec",cmd,src); executeable = compile(cmd); // convert :, !, #, (..), into corresponding code units, build/return pdf=PDF(..) report = run executeable src nProcs procLayout memsize commCost; // run keeping track of time and comm costs. print report; } } // linkable library function function pointer PDC(div,comb,pre,post,leaf?,atleaf) { this = new struct PDCObj(d=div; c=comb; b=pre; a=post; leaf=leaf?; bounce=atleaf); confirm function d takes vec arg and divides vec into a tuple of vecs; etc. method run(input) { if (leaf?(input)) return bounce(input); else { split = div(input); return this.combine : this.post : ! this.pre : (split.L, split.R) } } } // or something.Further ideas might come from:
And in the world of communications, it seems there ought to be a blocking structure for the FFT coefficient matrix and sample vector so as to increase computational intensity, similar to George's matrix multiply by-blocks algorithms 4.9, 4.10, 4.11. Given some processor memory limit, there should be a way to calculate the computational-intensity-maximizing sub-blocking scheme, also optimizing the tradeoff of number of processors. (It may be that enlisting extra processors at some point is costlier than the benefit thereof. If so, calculating the optimum tradeoff is desireable, then achieving it.)
So, questions remain but coming so far is a step forward, for me.