Below are my unedited notes from conversations with Zhijing George Mou in 2024, at his apartment, at Thanh Brothers Vietnamese Pho restaurant, and elsewhere, sometimes over the phone.

Near the end I wrote down some:...

Questions for George

Tell about your parents and if you had siblings, and their influence; I don't care about political significance.
Please summarize and compare family influence and social influence in your life.
Tell about some of your teachers.
What was #4 Middle School like? What is a middle school?
What was Yu Luoke like and what was his potential if not murdered by the government?
When he was in prison were you already in the countryside?
Are you still in touch with anyone from those days? Yu Luowen, Wang Jianfu from #4 school.
What do you know of his essay On Wages?
The determination to make something of oneself and for ones family is powerful; is that why your Journal, which offered a path to those prevented in that determination, so popular? Is it only the oppression of a classist system that motivates the popularity? Or is it universal?
What was the path offered? Battle? or self-improvement? If the CCP was against the couplet, how could they be against the journal?
In a book Half of My Life 2005 Xu Ziao says you are still a spiritual wanderer to the present day. Please comment.
What made you decide on EE, science and math?
Please let me photograph you. Also your homework stacks.
When year did you go Down to the Villages, why so long? Who went with you? Was it your choice as it was early? Did you believe in the link up with peasants and workers?
What books did you take, what was your syllabus in the countryside? Do you still have those books?
Tell about your August 1966 big character poster, what led up to it and what followed it?
Tell about how you switched from politics to science. What incentives evolved?
Tell about your trip to Vietnam. When where who what why how
What kind of government should China have in the future? Turnover of high office, decision-making, separation of powers, voting rights, federalism, ethnic representation?
Is the CCP lesson from the Cultural Revolution to allow no political choice or flexibility, that Mao's revolution from the top was a pure self-destructive mistake, and thus to immediately and permanently crystallize bureaucracy and corruption
Does the CCP;s military history constrain itself? As a style of bbureaucracy, top-down, anti-democratic, etc.?
Hoq old were you moving Lushun to Beijing? How did you get into Beijing #4?
What is your message to the future citizen of China or the world?
Is the Chinese character anti-democratic as it is submissive to the Big Man Archetype?
Do you admire Genghis Khan's non-favoring of family?
Will benefax supplant the capitalist/socialist dichotomy in public decisionmaking?
Can/How can neural networks combine
- knowledge (pattern, texture),
- perception (filling a space of temporary possibility with temporary content), and
- reasoning (considering alternative plans subject to logical constraints and dependencies)??
How should a chip company enter into the market and defeat NVidia?
Specify the best way to build a robotic perception/action loop.
How to learn their control-law association,
How to expand domain-of-use from simpler to more complex while preserving generalizeable domain knowledge
How to represent space/time trajectories for actuators and in temporal object perception (target mapping and prediction)
I started a divacon interpreter. Please comment.
I heard Zhiqing means a person sent down, or something. Is that related to your name?
Can the Kalman filter be improved? Track target qualities in a trajectory, adjusting by a fraction towards measured vs predicted, then adjust control toward desired.

Mou Better

George Mou talks.

5/27/24 phone call with George Mou

Regarding my question which asked, what is communication in the GM system and how does a function between index sets express it? On communication as indexes: any computation consists of two parts, the calculations and the communication that makes it so that the data is there for the calculations. Maybe you’re in a parallel processor but maybe not, so don’t assume irrelevancies, just get the logic right. So suppose what you need for a calculation is all in a vector, then the data would be found under indexes. To calculate a lot of sums, like V[i]+V[j] you need indexes. So indexing is a form of communication but normally it is singular, moving one piece of data, not a group. Whereas consider logically parallel computing, say LR or EO which are 1/2-1/2 splittings, that can do everything in Numerical Recipes. Eg FFT: all odd even splitting is equivalent to LR splitting it can be proven (refer to theorem please).

For any I you can communicate with I+1 in a one at a time approach. But instead open your eyes and go with a group of numbers. In any (stupid, current) programming language only the adjacent next is available. Not so in divacon.

I asked a question, does communication refer to moving data among levels of the caching hierarchy? I offered the thought, perhaps let the code handle it explicitly, then you would have communication, concretely. George says No, that would be dumb, let your compiler do it. On the other hand, no other system offers group communication, which under Divacon is offered and works.

For this reason in Divacon, the FFT algorithm is just two lines. The bitonic sort algorithm is 3 lines. Not 5-10 pages like others use. (I can confirm this because I have written FFT code and that’s about right.)

I asked: Is div LR group communication?

In chemistry there are only proton electron neutron, 3 things make the universe.

In Computer Science there are 3 orthogonal dimensions: 1) structure manipulation like Div LR or Div EO, etc but in structure manipulation there is no communication nor operation. 2) Communication, as in, what information is needed for one’s work (getting it there by machine copy commands is low level; the right way to think of it is, what is the information needed, not How to move it there). Communication contains no structure manipulation and no calculations/operations. Finally 3) operations: these assume the data is right there, no communication necessary or (encouraging properly factoring your work) none is even allowed. Hence each of them, SM, Comm, Ops, is orthogonal, independent, of the others. And yet with the three you can have all of CS. Orthogonal dimensions are so important. Then you avoid muddle, especially muddled thinking.

Communication is data movement. Move data from one location to another. Correspondent communication is move data from left side to right side. Think of. It like this, you cut a stick in half then move them and to end to align: each i’th element is adjacent or aligned to each other on the sticks: this is called correspondence. This is all you need to achieve every algorithm in Numerical Recipes.

In stupid current modern Programming Languages what you look at is one item, but in this you can look at K items at once. Then in the von Neumann bottleneck you are stuck handling one thing at a time. Hence we use loop instead of higher order functions. Instead, in logic, move two — which breaks the bottleneck, even though in logic you don’t impose time constraints, you are not saying when to do the moving. Don’t add any constraint which is not logically necessary

George asks zoom, What is Computer Science? What is the most important thing? Tom replies: Abstraction. Yes. And all current CS is counter-CS, contrary to CS.

After 70 years still the community hasn’t figured it out. Looking at a maximum of one number, that’s all they can do. To look at two numbers, that is considered too horrible. But if you can do one, then you can do 2 or any number K, by induction.

An Allegory: In ancient times they had limited arithmetic and limited concepts of number. Two primitive people were challenged to a mental contest. What is the biggest number you can think of? One scratches his chin, thinks for a long time, announces his answer is 1, the other th8nks even longer, and finally says, I give up you win. That is the whole field today.

The great miracle algorithm, the FFT, can use correspondent or odd-even either way, they are equivalent. One is premorphism-plus-odd even vs postmorphism-plus-LR

Each reduces to the other.

Tom says: in CS departments there is a tiny minority that does theoretical CS, proving stuff about code, and they think they are better than everyone else. After these conversations I think I am becoming one of them.

George says: Theoretical CS is mostly about computability, complexity and automata, not really programming language theory. Still that is the most respectable work in CS.

But Gates and Musk often say stupid things but people have no idea.

Science should theoretically progress: even negative answers are progress in a world of unsolved problems. For any problem given infinite time there should be progress, but is there?

Tom offers the analogy with rational and real numbers, rational (say decimal, or binary) numbers are a finite sequence of choices from a finite vocabulary, whereas the reals also include unending unpredictable sequences. In the same way every proof or algorithm is a sequence of statements from a finite vocabulary: finitely long if a halting problem, never-endingly unpredictable if non-halting. Also in the same way that in practically computable applications we can use only rationals and have no need of real numbers (even floats are rational and finite), similarly to say anything interesting or useful we may not need any infinitely-unpredictable-proof based theorems. Hence I am not suffering because of Gödel.

George replies: in the halting problem literature it is shown that no algorithm can finitely decide if an arbitrary program will ever actually end. To prove this from abstract algebra: given two domains infinite but countable, the set of function between them is uncountably infinite. Any algorithm written in an alphabet is infinite but countably so. However problems are also infinite and at least countably so, hence a function between problems and algorithms is uncountably infinite, hence no halting.

George says that in his last lecture in his algorithms and computabity class he says: we must know in front of God how tiny we are. Our tiny speck, which is the finitely solveable problems, is infinitesimal in the face of the uncountable infinity of algorithms for problems. Therefore Fear God. Musk and Gates are devils thinking they can solve everything. Tom says, I’m sorry I must be a devil!

George offers his CS students a problem on day one. Write a program to write all current literature. There is an easy answer: you can easily write all 1-letter literature. Therefore you can also write all two letter novels. Etc. (Kleene Star.) By Induction.

DC programs are all running, Tom, it’s not a joke. Divacon is implemented on 8 machines. Tom says, well I just can’t download it from GitHub and run stuff that I write. Not yet anyway.

George says, let’s talk about the Recursive Array.

What is a data structure? Data plus operation!

A vector is a group of data and an operation of indexing. With the monofocal eye problem of modern computer scientists (the Von Neumann bottleneck) they cannot look at more than one thing and even then only from the end of an array. But a Recursive Array is an object that splits into (balanced) subsets. If you can look at K elements at a time then you get group communication not just indexing. Also relative indexing as when after splitting a (Eg 16 element) vector in half you have a+1..8 and b+1..8 in the splitted subgroups of the data. In two subgroups corresponding communication is a generalization of an array with multifocal, multi-target eyeballs.

GM says it’s time to write a book, although everything is in the thesis. Tom agrees, it needs to become a book. George, It needs not the details but the lessons: orthogonality, the story of theone eyed primitives, higher level lessons, etc.

Tom says: yes, do it!

FFT =

Divide: Div_eo

Conquer: {out i=0}, {out i=1 } = {in j=0 + e^i*depth^2 in j=1},

Conquer: {out i=0}, {out i=1 } = {in j=0 + e^i*depth^2 in j=0},

Given A’ A = I show A A’ = I:

Then A’ A A’ = I A’ = A’ post multiplying by A’

and A A’ A A’ = A I A’ = A A’ pre multiply by A.

Also A A’ A = A I and A’ A A’ A = A’ A I = I I = I.

So A A’ A = A = I A = A’ A A no help.

Discover the O(log n) PDC for FFT

———————

Phone call with George Mou June 16, 2024.

Kidneys struggling with myeloma, reduced chemo now but not improving, and CART to start in August.

GM went from Yale to Brandeis to WaveTrace

Then he went to give a talk at IBM, where his first slide said,

“Why the IBM parallel computer is completely Wrong”.

He gave the talk, then asked, Any questions? No. Ok they asked, is there a paper? Here's the paper. They said, We will show you why you are wrong, he waited for their feedback. Instead he got a job offer from IBM.

Later he went to John’s Hopkins high performance computing lab and group, where he was supported to build a computer and run his stuff, so he bought chips and built a parallel computer with 32 nodes, and ran many applications. He had many people working under him. So he would try to give them something to do. Explain, wait 2 weeks, they don’t get it at all, they are completely confused. Explain again. Do you understand? “Yes.” Two weeks later they still didn’t understand. “After a number of iterations I gave up. I went to the lab, said, you watch me. After five days Divacon is running on the parallel computer. “

Later he was invited to consult on the Connection Machine by Guy Steel, who later was the Turing Award decider, George was the first speaker to speak at the Connection Machine company.

But no matter who, no one really gets it.

Like at Yale, my adviser Perlis after the talk says, George what you talked about looks beautiful like a crystal palace. Only one problem, we don’t understand it. Please wait in the hall for a while. They came out and said, we don’t understand what you are talking about. Did they figure it out? They explained that the way I walk and the tone of my voice and your eyes are telling us, He must be right. We believe you. But we don’t understand.

George: To come back to your question, Did you have a compiler team? No I never found anyone who could help me. Only exact copying. lol .

Look up George’s paper For communication optimal FFT, on a

K dimensional FFT on M dimensional mesh. In:

CONFERENCE PROCEEDING Parle '93, parallel architectures and languages Europe : 5th International PARLE Conference, Munich, Germany, June 14-17, 1993 : proceedings International PARLE Conference (5th : 1993 : Munich, Germany) ©1993; Berlin ; New York : Springer-Verlag

What is a mesh? 2D mesh = grid. N x N is a grid of processors. A 3d mesh is a cube of processors. 4D is easy to describe and theoretically possible and in math easy but no one has made one yet.

Optimal means in correspondent communication to all other processors, the minimum imaginable is the travel distance or diameter of the network. If you can show its communication time is equal to the diameter of mesh then that would be the fastest ever.

So consider the problem of How to assign the nodes to the processors of the mesh, so that the comm distance = the diameter of the mesh.

Odd even… it was mentioned in the thesis?

Two options do your FFT in time odd even or in frequency left right. From the DivACon DC view they are the same exactly. EO is equivalent to LR as pre-odd-even = post LR. But people still don’t see it. The FFT algorithms are listed separately. (See NR).

Premorphism under correspondent communication with odd even, is equivalent to Post morphism with division being LR.

Yes DC uses morphism at its core. You can understand when you get the breakdown.

Bitonic sort is so confusing it has the adjustment being itself a DC. I have my own algorithm, a Monotonic sort, more elegant and cleaner. (Thesis 4.3.5, p89.)

In my Monotonic sort I have the first level is a post morphism, the adjustment function in the second level is a premorphism. Alan Perlis asked, can you reverse it and make the outer one a premorphism and the nested one a post morphism? “Still I can’t solve his question.” If corr, correspondent communication, then yes right away but if using mirr I still have no answer. Some day I’d like to come back to this again.

Remember Backus in his Turing award lecture emphasized algebraic transformation,

So GM showed how, e.g., #(last 1) transforms to # corr,

You may ask, why? Last one communication is like this. Suppose you have to call 8 people to go to a party, last 1 means 1 calls all 8. But it’s faster in a phone tree, which is done in 3 steps not 7. That’s one transformation shown in the thesis. Another is corr = LR under the condition that communication is correspondent. Many such algebraic transformations can be done quite easily.

Is it mere fun? No, it has much significance. By means of such a transformation, O(n) could becomes O(log n), for example.

Even Google people don’t have the concept of computational intensity.

CI = Local calculation count / Communications count.

Given a bit of data, how many times do you use it? If only once then it means you need to move it for every use. But it should be many times, not once, since communications are actually slow.

Looking at matrix multiplication, data is N^2 and computation is ^3, so the ratio should be CI = N. (See https://sites.cs.ucsb.edu/~gilbert/cs140/notes/ComputationalIntensityOfMatMul.pdf)

So anything that has CI ~ O1 is stupid.

Like for example Google. Like Map-Reduce. Consider the dot product, it’s a great tool, but it has computational intensity of O1. You have to bring in each factor, then do one multiply/add. There is no reuse of the data. That is, a vector inner product, N dot N is actually very slow and yet the whole field is using this. Inner product is done all the time by everyone in AI, linear systems etc, so much waste of time, money, electricity. But who wants to hear that?

What’s the trick, the better way, the improvement? Just don’t use inner product in the first place. Done the right way, NxN Matrix Multiplication has CI of O(N), can it be divided by DC. One way is to do eight smaller (or two or four), basically reduced to Om with m Pdcft O log n is known.

I’m dumb. DC in a different sense.

Once you understand that, throw it out and look at the computation. What kind of communication, divide, local, combine it becomes so simple even a grade school student can get it.

Build a DC using python. Implement the different PDC parts div comb comm and build it and call

Then it will run.

I had 3 PhD students at Xiamen Univ a few years ago. I explained things to one, in detail. Still not.

Not running on his own computer. A few hours later it’s running. Even watching by my side he couldn’t do it. Why so hard I don’t know.

Local vs comm vs structure change.

DC is simply to identify the three.

Ma() as combination is wrong thinking.

Once divided that way very some. People quickly confuse themselves.

—-

Tom thinking about wavelet ish Fourier transformation.

Instead of a Fourier equiwidth array of F_b’s,

just don’t overlap the b’s, thus a

Fourier tree with every subsequence analyzed having only 360 degrees to yield a result, ie no bend overlapping.

Let F b,N,X bits = sum_{s=0..2^(N/2-b)} X[s] e^{isTb/N}

At DC, the window is full on ei0. F0

At lowest frequency, b=1 the whole window wraps on one bend. F1.

At b=2, half the window wraps for a bend, that comes out as a freq result for F10

And the second half of the window wraps for one bend which comes out as F11.

Recurse till a single sample pair wraps to one bend. which is difference of the two points.

Last to a single sample at e^{i0} is the point itself.

Hence the Fourier tree includes at the leaf level

the data itself, preterminally the pairwise differences (nonoverlappingly),

and as you go toward the root of the tree, widening windowed, lower frequency F values using more samples.

F(0|1)+ define the indexes, each has a complex value.

Only half the data informs each node at the next level. Is that a problem?

If you want to represent fine detail finely you should sample faster for more data to estimate at the higher frequencies.

Should the results at each level be subtracted from the next level?

Do we have orthogonality? If so then no. (Really?). Code it with a flag yes no to shift the points at each level to center on the F[bits] complex frequency at the previous. And see what is more interesting or effective.

Can this be generalized to 3D for memory or scene representation, or to 2D image or LIDAR map processing to yield a 3d multi scale bloblet transform?

Mou thesis errata.

Item 4 p56 missing close square bracket.

Example 4.1 p78 last line, 28 is missing.

P87 last word identity not identify.

P88,89, algorithms 4.9,.10,.11 “mm = (“ should be “mm = PDC(“.

Algorithm 4.11 should have atom? before mm_b.

6/19 George Mou, phone call 1pm after meeting Simon.

Here tom, think about 1+1 = 2, do you or don’t you throw out the Apple, and do the math regardless of what the counted things are. Tigers, trees, apples, or other counted item. For FFT all you need to know is in the spirit of divacon, no Apple tree or Tiger, but communication calculation division (= structure manipulation).

Butterfly multiplication, two numbers…

In a good algorithm what is the division…

In the thesis are sets of PDC, divide into two and within each the parts communicate and the results go to a new result by the local operation.

Division is structure manipulation. Cut one piece of rope. Is it by indexes? No.

Division does not do normalization.

When you cut a rope in the middle, it’s a new structure. Yes you could say the first element of the second is array was #4 of the original, but equally valid is it has a relative index so no it’s zero on the right which *means* 4 on the parent. It’s a relative index.

Division communication and local computation.

None contains Anything of the other.

Local op and communication and calculation.

Frequency amplitude etc, forget it.

If a 2nd order PDC should not be a 2nd order PDC. That’s over complicating like 1+1-1+1. So it’s not 2nd order.

Resist is a combination of the two so composition of communication plus local operation over the communicated data.

The adjustments are performed in the division or combination phases. In dividing phase its post if in combining it is preadjust (i think i transcribed this backwards).

F x = y so what’s done in the domain (x) is pre, if done in the y domain then It’s post.

In FFT in time domain is pre, in freq domain it’s post.

Similarly counting ones, one what, we don’t care.

Even in applied math you might skip the units. Even Newton F= m a I don’t need to know f what.

Odd even is a premorphism. LR is a post morphism. “Tom you are not alone. Very smart people don’t get this.”

To build any PDC pick for division just choose 1 of 2; communication is correspondent ie butterfly communication, local op is coefficient times other value plus self.

“If you take more than one minute to write a PDC down you know you must be wrong. “

Let’s consider Newton F = m A, you might approach it with a lot of experimental measurements etc. Then maybe you will believe the math. You can say the experimental work is admirable, that’s okay, but it is unnecessary.

This decomposition is so powerful.

In term of programming FFT we have LR, CORR, MA that’s it.

Don’t write a Divacon interpreter, it should be self persuasive. To run it for applications, ok feel free to do it. But spend one minute to write down the three things. Write down a PDC for FFT. And send it to me.

Communication should feed calculation.

Do some examples in the thesis and really understand it then come back to this.

2nd order PDF aka how to make a fast algorithm slower. Don’t figure it out don’t even figure just forget about that.

“A real story: at Boeing I was a pretty high up manager. But atypical among managers I did my own programming. My group said they completed something. Usually managers wouldn’t look. I looked and said you guys are very creative; they looked happy. This is really original work, bigger smiles. You are very talented in the way you present this problem to me, to work hard to make it complicated even in 100 years I couldn’t do something so ridiculous and slow. This requires a lot of originality. I was so harsh.

It’s like in physics learning F = ma to do experiments and then see if a result says so, then later we throw out all the quantities and units. PDC says throw out all the instruments and units.

One plus one equals two is a big thing to understand, many have a hard time.

Most people never graduated first grade.

Ok I’ll just try to understand something

maybe not even separately from understanding computing it. Ignore the meaning of the parts in PDC. 1+1=??: Okay two is the right answer. The calculator didn’t ask for what things you are counting. PDC gives you the answer, boom done. The result is right.

So just go to the PDC directly, write it down and do it right.

Do some simple examples in the thesis, simple and comfortable ones. Develop abstraction and throw out all unnecessary details.

George 7/5/24

I start explaining an approach I'm trying to take to understand and do one of his ideas. Finally he interrupts saying, Tom, you're going in the wrong direction. There no point to explain a wrong solution.

This is the way to talk about fibonacci numbers. By definition a number that depends on some previous numbers in this case two previous numbers with some dependency.

You can expand it to put it in a matrix form, so the two numbers before it and the number itself. F n-1 in turn depends on F n-2 and F n-3.

The trick is to shift the dependency from two previous ones.

This is captured by a 2x2 matrix, each time you shift the dependency it changes, the way it changes is matrix multiplication. Until it reaches all the way to F1 F2.

For each group of 2 you align the dependency so it relies othe same two things.

Then for a group of 4 you align them so they rely in the same four previous Fib numbers.

Then align 8 and make them all align on the same thing.

Then you can shift the dependency with broadcast, which does nothing but two,

Within the current division group, too small,

The data representation never grows, two 2x2 matrices.

Finally it’s moved to the first group in the entire sequence,

In that case you have the final solution. Shift two by two matrices

No larger matrices, whenever the data to be communicated dgrows, you know you are wrong.

By hand do a 16 point FFT.b

#2, do scan by hand and pencil.

Then do two more. Scan related.

Another is, do a scan, a premorphism scan. Do it and run it.

Another homework. This scan uses broadcast. On real machines it means the data needs to go out, to be sent outfox fan out of 32x for 32 receipting processors. This imposes some sequential time cost. Can you transform the scan to an algorithm using corr as opposed to broadcast.

So run it with the divacon interpreter in your head, using a pencil and paper.

Make it work, then you know it's right.

T: we are tracking the types? G: It’s exactly that.

You should remember divacon is following the functional programming approach.

The basic element is the function, not numbers.

A function always has a domain and codomain and the mapping between then.

That’s why the

In general in any divide and conquer algorithm, the division is in the source domain but dividing, the combination is in the co-domain of the output results.

The + involves both communication and local operation.

Doesn’t have to be bidirectional like butterfly FFT.

+ : # comm

For any DC, this is so.

Perfect shuffle, inverse, reverse, etc.

And for stupid map reduce which is all Google understands, beyond which they understand nothing.

Another day: Don’t just read and understand each sentence and combine from beginning to the end in flow, that's a start but then now you cannot insult the work by not doing the exercise, for each day of reading, time spent reading vs exercise, is 1:20. If a person says, have you learned calculus, I told you I learned calculus when I was sent to the countryside in very harsh conditions with my schoolmates, not only one textbook but one as main and another was 3000 problems or more, so I spent 1.5 years to complete it, built myself a really solid knowledge and ability to solve problems. Some schoolmates actually commented on me, George, people are really different? Yes? Yes. One other classmate started learning calculus and in three days he finished. I said. Okay All right. Of course, eventually the Cultural Revolution ended, and college reopened, In 1/3 Shaanxi province I was number 1, that guy failed it. In 2 months I took the entrance exam to grad school, there they really rest the calc problems, I was competing with students from before Cultural Revolution. Score #1 was 61, #2 was 60, me was #59, #4 was 39.

I’m just saying doing exercises is very important. It’s not a novel. In three days he turned over all the pages. People are doing that, and calling it good. But that's wrong, and so he never went to college as a result.

Are you doing an Open source project? Some day so. Don’t feel very good. Looking forward to car t procedure. Hard for a couple of months but after that a new world, I imagine I can basically go back to my old state. I used to go to gym every day of the year, 2-3 hours every day, and in winter a lot of skiing, summer a lot of roller blading. I don’t think I can completely go back to that, but at least work on the research.

Going back back to the proof of DST that it can search k dimensional space in constant time. I have a proof but simpler now which will be shocking to the community. The previous version is not published.

Your question showed a lot of confusion about a lot of things, Tom.

A network has a topology, can be a 2 d mesh, nxn or 3D nxnxn, or 4d or 5d or log dimensional mesh is called a hypercube. Any mesh has a diameter, referring to the shortest path between remotest processors. Use 2D as easiest to plug into the head, for a 2D diameter is corner to corner. N-1 + N-1. You cannot make it log N away.

A 3D cube is n-1 * e3. Two furthest are log N.

So to pass a message to someone who is not your direct neighbor. Electric poles in the country, just pass the power it to the next.

Higher dimension, smaller diameter.

Nvidia is so stupid it is unbelievable, and so successful, #2 biggest enterprise in the world with such a stupid idea. It is Beyond imagination. I can ask, Can you invent a more stupid communication model? No. I cannot think of one. I challenge you to find a stupider model. I believe you cannot invent such a one. Well you could say, Let’s send it to everyone 1000 times, or send randomly to anyone,then you would also have to be crazy.

A paper I wrote that you can read is called “communication optimal algorithm of K d FFT in Md mesh for arbitrary k and m.” Best paper presented in international conference I Munich Germany, got the best speaker of the conference, of course no one understand it. Search with “Z. George Mou”.

7/6/2024 phone call notes.

The key to the Fibonacci parallelization is CI: computational intensity: the ratio of operation count to data element count.

E.g. matrix multiply has data of 2 NXN matrices and N^3 operations so CI = O(N).

Google map reduce has o(1) intensity.

You want a small block of data and a lot of work. No one seems to know this principle.

TV are these things in anyone’s curriculum?

GM they are in my curriculum. But the general statement can be made as follows:

Anything good is not in the current curriculum. Particularly when it comes to parallel computing almost everything in the curriculum is bad and wrong, as wrong as could possibly be.

TV: ArrayFib for n number of fib numbers the task to compute all in log n time means it needs to be done in parallel. Any one can be done in log n by itself.

GM: Fib is special case of order k difference equations with order k=2.

Also it is a special case of a triangular linear system, below diagonal the are all constants, above are all zero.

New publications in the field are all stupid. Unbelieveably.

Consider for example you know automata can be considered as transducers, right? But if I had input and outputs of 2^20 elements did you know, same as fib you can do it in log n. Yes for an arbitrary finite state automaton.

GM: I have a principle, anything that looks sequential is actually parallel. Neighbor dependency seems to prove sequentiality but by GMs principle it must be highly parallel.

The arguments always go, By induction this process operates 1 by 1, then …

TV: that’s nice bait you dangle before me. I am the fish you are the fisherman.

GM: Somewhere it says in the thesis but any postmorphism with correspondent communication and dlr, division, is equal to a premorphism with deo and corr,

Just draw the picture then you will see and know. Assume local operations same. It is too obvious, it is trivial. And that’s why there are two FFT algorothms. But they publish them separately because they don’t have the vocabulary. It’s so simple.

TV: So ok that’s already a big deal, but what if not corr comm?

GM: The three basic communications are: corr, broad, mirr. Correspondent, broadcast, and mirror.

Broad can be transformed to corr. Think about it.

Mirr: can you turn mirr into corr? TV: Reverse first? GM: this is a good question but it needs to be worked out. But to do that, first show the transformation from broad to corr. So for clues to turn one into another look at that case then work on mirr.

Backus in “Can programming be liberated from Von Neumann”, said, programs should be syntactically transformable like mathematical expressions in calculus and algebra. (Eg associative…) In DC there are many such equations. Eg this one of dlr + post + corr = deo + pre + corr. Lots of these exist. T: that’s amazing. GM: looking at current papers they are not ashamed but they give stupid solutions.

TV what is a recipe for better papers? Maybe, Do they separate the three dimensions. GM: you can’t use divacon without separating them. So just say it in divacon. Just like in chem and physics, the three things photon neutron electron, you have to decompose to those to understand anything. If you don’t do that you have no idea. It is, Come to divacon or Fail.

TV: Like Zellig Harris and henchmen at Penn thesis defenses, you will say How does this fit into MY theory. GM: maybe in linguistics, but here, the fact that the universe of computation this whole world, consists of the three components (TV: dimensions?) is a fact. Anything in conflict with this view goes nowhere. You don’t have to be polite. Same with DC. In fact this can be said: In the entirety of computation there is nothing else but these three things. For any computation, whatever a computation it is made of is, all and only, these three things. So for example, Iteration is also Divide and Conquer, indeed a stupid one, but also dc

For example a For loop, is nothing but head tail division. It is DC. Ask generally: Can you turn head tail into balanced? Yes under certain conditions this can automatically be done. TV: What conditions? GM I should be careful to be fully general in the statement, but in most cases it is trivial.

Both: Everything is either trivial or false.

So you see pre to post, broad to corr, and HT to LR and sometimes to EO you can go from one place to others, there are a lot of choices and so as Backus said, a program should have that property.

GM: it’s like in circuit design, one Boolean expression can be realized by many circuits; you can pick out one for what fits in terms of speed, gate count, etc. In terms of parallelisms it’s the same. The world does not know this.

GM: I owe the world to talk in such a way that people do not give up in the middle, like my Yale professors who said we dont understand why, even though we think you are right. I did get best dissertation of the year at Yale, and for nomination to ACM but not even one really understood.

TV: Am I the only one to come sip at the fountain of George Mou?

GM I had students but it is sort of hopeless. Brandeis, Hopkins, Xiamen university, if I gave something to them they would graduate but if not they are going nowhere.

T: To me it’s like the need for homework. To become a pickle one has to marinate like a cucumber in the salt or the vinegar of the problems and expressions.

T: What village did they send you to when they sent you in the cultural revolution? Chang hsing tsun village.

GM: I listen to Tchaikovsky and think, how could he have done that?! NOBODY could have done that. I can only think, he must have gone to another place to bring this to us. Like going there and God gives us something to take back.

As an example: DST came from a challenge: with my teammate we said split the problem; George, assume the 3d space search problem is done and you do your part. So GM did his part. After two weeks, great results. But by then the other fellow had nothing. What have you done? He had looked at hexagons. GM said ok that’s beautiful but what does it give you? No answer. Does it have to be so complicated? Can it be simpler? “That night I watched TV till 11, then by 3am it was running. Plus I spent part of the intervening time writing lots of theorems. My friend said if you can do this I’ll give you my one month of salary. He didn’t give me the salary, of course. Even now after this remote time the algorithm and results I would not change at all. I look back and think, How could I possibly have obtained all those insights in a few hours. Even now I have difficulty to understand. My explanation is someone took me to heaven to give me something and threw me back down to earth. Later I did spend several months teaching this to others and even then people did not understand, those were PhD students. Not even close.

Tv: George where did you get your spare and mathematical writing style? GM: I read a lot of books and of course they influenced me. At Yale a little book called elements of style (struck and white), of course I read it carefully. I also believe in this: if you think clearly then you can write clearly.

TV: The k dim FFT m mesh, paper, it’s a paywall loop. GM: keep looking, from some places you can download it directly. That article is probably completely hopeless, the way I wrote it no one, not even one person, could ever understand it. But give it a try. It’s really good. If the entire field could understand that they could save a lot of computational efficiency, even these many years later. Thinking about that is very depressing.

TV: that’s why I’m here, George, to give you hope.

Ok let’s stop for now. Time for dinner.

—-

7/8

Decompose pbits with pbits.max into bit substrings that know their levels a, finally single bits 1< Then basefunc sets 1bit values 1< Then post adjust join 2^b bit nibbles n,k as N#K (doesn’t need to build up from a=0 to a again).

Then postadjust 1< With GM:

In the 1990s there were 3 parallel computer companies, wavetracer which had a 3D processor mesh, masspark which had a 2D processor mesh, and Thinking Machines which had an N dimensional mesh.

All had the same problem, every user was using it for 2D FFT almost only.

The Thinking machines folks would say to their customers, ok but ours is an N dimensional mesh who would say we don’t know how to do our FFT algorithm with N d, so we can’t use your machine. So the smart guys from MIT said we will add a 2D mesh to also connect the processors so it can do a 2D FFT also. So just put your 2D data onto that mesh with same indexes, it’ll go. They spent a lot of money, a lot of commercial advertisement money, half their company funds, finally GM went there and gave a talk and said by doing this you slowed it down from O(log n) to O(sqrt n), they didn’t look happy. But I simply showed them; they were shocked, but I ran it and showed that I was right. That is a huge huge difference, 30 steps vs 2^15 steps, in a 2^30 size 2D FFT. Then there was also wavetracer, I consulted for them and one time the secretary called and said the president wanted me to come in.

P: George we have a problem give us a solution. G: What problem? P: Most users need 2D FFT. but we use a 3D mesh, so we lose a lot of deals for that. Do you have any idea of to overcome this? G: What have you done so far? P: Last time we used half our investment, our R and D money, to implement 2D on top of the 3D mesh. g: Really. P: yes. G: Give me 15 minutes. In 15 minutes, George said, bring the chief scientist and other tech people. They came. So… Here is how to do it, and the complexity is log n steps and total communication cost is cubic root n, while your 2D wavetracer method is sqrt n. So by spending half your investment you have slowed it down from cube root, eg 10 steps to sqrt 15 steps. But don’t feel too bad, Thinking machines did it even more stupidly. Your waste was for not as huge a slow down. That’s how this paper came out. It won the best paper of the conference award. But

Even now the entire field does not really know my paper. Instead they are very creative in finding ways to make it *slower*.

You may ask, how can you prove *optimal* mappings for communication for arbitrary K and m. This paper gives the solution, which no one can get faster till the end of the universe. So read it with this question in mind, How can you possibly show this is optimal. The easy answer is, show a lower bound, then show that you reach it. So have those questions in mind then you may have hope to understand it.

BWA significance is 1D search, there are many solutions, all based on trees and link following, all taking up space and time. so the question is can you get rid of those links and link chasing steps. The answer is yes, the answer is BWA, which is 5x faster than red black and less space besides.

—- more notes. 7/8/24

Divacon is an opinionated PDC design scaffolding.

It wants 2^a inputs. It requires balanced division and combination.

It expects a recursion inclusive of a distribution: F = … ! F …