Zhijing George Mou<BR>Bibliography

Every solution is either trivial or false.
	-- Zhijing George Mou.

Ordered by date of original research or of publication.

C. F. Gauss, "Nachlass, Theoria Interpolationis Methodo Nova Tractata," in Carl Friedrich Gauss Werke, Band 3, Koniglichen Gesellschaft der Wissenschaften: Gottingen, pp. 265-330, 1866.

H. Burkhardt,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.

H. Burkhardt,"Trigonometrische Reihenund Integrale(bis etwa 1850)," in Chapter 12, Encyklopiidie der Mathematischen Wissenschaften,vol. 2, part 1, 2nd half, 1904-1916.

J. W. Cooley Y & J. W. Tukey, "An Algorithm for the Machine Calculation of Complex Fourier Series," Math. Comput., Vol. 19, no. 2, pp. 297-301, Reprinted in Digital Signal Processing, ed. Larry R. Rabiner & C. M. Rader, pp. 223-227, New York: IEEE Press, 1972, Apr. 1965.

Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing Lecture. (ACM V 14 No 1, 1967).(cached copy here)

Yu Luoke. “Chu shen lun" (On Family Background). Beijing 1966-1967. In Journal of Middle Schools Cultural Revolution, edited/published by Zhijing Mou and the Family Background Study Group of Beijing.

TLDR: This paper argues that all people are created equal.

M. E. Sharpe: "This essay was originally published on January 18, 1967 in the premiere issue of the capital’s Journal of Middle Schools Cultural Revolution. Its first draft was completed in August 1966, and it was subsequently revised in November 1966.
"The author of this essay, Yu Luoke, was a young worker and thinker in Beijing. Because of this writing, he was executed by the government in 1970."

Donald Fraser. Array permutation by index-digit permutation. Journal of ACM, 23(2):298—309, April 1976.

The 1977 Turing Lecture. Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs, by John Backus, IBM Research Laboratory, San Jose CA

Michael T. Heideman, Don H. Johnson, C. Sidney Burrus, "Gauss and the History of the Fast Fourier Transform", 1985. Rice U Eng. Dept. cached copy here.

Z. G. Mou and P. Hudak. An algebraic model for divide-and-conquer algorithms and its parallelism. The Journal of Supercomputiug, 2(3):257-278, November 1988. cached copy.

The design and analysis of parallel algorithms / by Selim G. Akl. 1989, Prentice Hall textbook, cached copy here. This was the state of the art before George Mou.

Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its Parallel Realization, May 1990, Research Report YALEU/DCS/RR-795. (Yale PhD Thesis) (cached copy here.

Z. G. Mou. Divacon: A parallel language for scientific computing based on divide-and-conquer. In Proceedings of the Third Symposium on the Frontiers of Massively Parallel Computation, pages 451-461. IEEE, October 1990. Link.

Abstract: An overview of the language, covering Divacon primitives and simple programming constructs that are referred to as functional forms, is given. Two divide-and-conquer programming constructs are discussed. Divacon style programming is demonstrated for a number of scientific applications. Some interesting equivalences and transformations between Divacon programs are examined. Implementation and performance are briefly considered.

Z. G. Mou and M. Goodman, "A Comparison of Communication Costs for Three Parallel Programming Paradigms on Hypercube and Mesh Architectures". Proc. 5th SIAM Conf. on Parallel Processing p491-500, held in Houston TX March 1991. ACM link here.

B. Carpentieri, Z. G. Mou, Compile-time transformations and optimization of parallel Divide-and-Conquer algorithms Published in SIGP 1 October 1991. ACM SIGPLAN Notices. Link.

Abstract: The goal is to make Divide-and-Conquer algorithms suitable for implementation on hypercube-like parallel architectures, such as the Connection Machine, even if the original algorithm implies a division function that is not the left-right division and/or communication pattern that cannot be implemented by direct-neighbor communication

Marc Goodman, Z. G. Mou, DC Transpose: a method for reducing communication in divide-and-conquer algorithms on mesh-based computers. Published in Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing 2 December 1991.

Abstract: The paper introduces the notion of a DC Transpose, which is a remapping of data on a mesh, using a global routing mechanism such as the global router on the Mas-Par MP-1, to solve divide-and-conquer algorithms.

Xiaojing Wang, Z. G. Mou. "A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers" Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing. 2 December 1991. Link here.

Abstract: The authors present a new parallel algorithm, based on the divide-and-conquer (DC) strategy, for solving tridiagonal systems, and show that for the binary hypercube architecture, the communication complexity of their DC method is the lowest among all, and therefore the most efficient tridiagonal solver for communication-expensive massively parallel hypercube computers.

Z. G. Mou, C. Constantinescu, and T. Hickey. Divide-and-conquer on a 3-dimensional mesh. In Proceedings of the European Workshops on Parallel Computing, pages 344-355, Barcelona, Spain, March 1992.

Z.G.Mou, Cornel Constantinescu, and T. Hickey. Optimal mappings of divide-and-conquer algorithms to mesh connected parallel architectures. In Proceedings of International Computer Symposium, pages 273-284, Taiwan, December 1992.

Z. G. Mou, Xiaojing Wang, T. Hickey. Divide-and-Conquer Algorithms with Recursive Broadcast Communication on Reconfigurable Arbitrary Dimensional Mesh. Published in SIAM Conference on Parallel Processing for Scientific Computing. 1993.

Z. George Mou and Xiaojing Wang, Optimal Mappings of m Dimensional FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp 104-119. in PARLE '93, Parallel Architectures and Languages Europe, Lecture Notes in Computer Science #694, Springer-Verlag. 5th International PARLE Conference, Munich, Germany, June 1993 Proceedings. cached copy here. Context and story here.

Z.G. Mou. A Divide-and-Conquer Parallel Algorithm for Banded Linear Systems. PP, 1995.

Jayadev Misra, Powerlist: A Structure for Parallel Recursion. ACM Transactions. on Programming Languages and Systems, Vol 16, N. 6 November 1994, Pages 1737-1767. Cached copy here.

Z.G. Mou. "Spectrum analysis and min-cut transformation of communication networks in parallel computers" Published in: Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation (IEEE) Conference in McLean, VA, USA on 6-9 February 1995. Link here. (cached copy here)

Abstract: We present a formal model for the analysis of communication networks in parallel computers. Unlike most others, our model focuses on the transmission delays as opposed to the propagation delays of communication patterns. The model allows all symmetric communication networks to be examined by their spectrums and characterized by their transmission dimensions. A min-cut transformation is introduced as a tool for the spectrum analysis, which reduces any symmetric network to the same canonical form. Parallel architectures with different topologies can then be easily compared and evaluated.

A Scalable Divide-and-Conquer Parallel Algorithm for Finite State Automata and Its Applications. Z. George Mou & Sevan G Ficici. Brandeis University, Dept of CS, National Center for Complex Systems. Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing, San Francisco, California, USA, February 15-17, 1995, ISBN 0-89871-344-7

Abstract: Our algorithm is a hybrid of two divide-and-conquer (DC) algorithms. A problem is first compressed to one of smaller size (measured by input length), the compressed problem is computed, and the solution is expanded. An unbalanced DC algorithm is used in the compression and expansion, while a balanced DC algorithm is used in the computation. This approach effectively reduces the computation and communication cost by a factor of n/p, where n is input length, and p number of processors. The time complexity is 0(n/p) for n {much_gt} p, hence the algorithm is scalable to both n and p. Benchmarks on the MasPar MP-2 and applications are presented.
Cached copy here

Tom says, tell me about FSMs. Finite State Machines, Automata, Transducers.

George says, you know what are FSA's: nodes/states/vertexes, transitions/edges/links, labels, input sequences, with the transition table, you could call it a function, so the output is a function of the state it is in and the input symbol.

[ outSym, succState ] <= FST(currState,inSym);
In this standard nose-on-floor view, you know the current state of the machine, supply an input, and you get an output and a transition to a new state.
For example, the same input may produce different outputs if it is provided in two different states.
This paper is about doing this in parallel.
Suppose you have a FSA with \(10^6\) or \(10^9\) inputs, the intuition is you have to give it one symbol at a time to handle, each time step you get a symbol, assuming it is in one state, look at the table and produce one output.
But is it possible to do this in less time than 1 per symbol? This paper claims you can do it in logarithmic time, so \(10^{6}=2^{20}, 10^{9} = 2^{30}\), so in 20 or 30 steps you can get all the million or billion (if 1-1) output symbols.

Tom: I'm guessing the answer is Yes.

The idea of FSAs is that the input takes you from one state to another along with giving an output. The function is the FSA, applied to an input, returning a pair, the output and the successor state.
However, you could rotate the perspective (Tom: or invert it, or lift nose off the floor), and consider that an input symbol is itself the function, it applies to one state and returns the same pair; the output and the successor state.
If the nature of this global-perspective FSA function is

[ OutSym, succState ] <= InSym(currState)
... then: you could compose these functions. As it turns out, function composition scales! Can you compose \(f_{1-2} = f_2 \cdot f_1\), in unit time?

Tom guesses: Yes if the unit contains time for all the columns in a row. Yes we can compose these functions onto one another in the order of the sequence of input symbols, creating at each level of the computation tree a doubled-string-length-equivalent function.

[ OutSyms2X, 2XthSuccessorState ] = InSyms2.InSyms1(currState)

Apply this in a Divacon Left-Right Postadjust computation graph, boom done, you have a doubled and doubled and doubled (as deep as you want to go) composition of compositions of compositions of InputSymbolic FST *functions*, finally at the top of the postadjust tree, the last function maps from the first input state to the final output state consuming the entire input sequence and outputting the whole output symbol sequence.

Did I slip one past you? Actually I didn't. That's the whole thing. But if you think back on it and now you don't think you got it, please read it twice more, make informal drawings of it three or four times, then you may want to learn about Divacon, post-adjustment and the Left-Right divide/combine functions, and visualize how the computation graph might look.

Finally, figure out what these objects are, the functions that are composed: Are they lookup tables? -- one row for each state with one column per input symbol, and the cells are empty if that input symbol is not accepted in that state, but when not empty the cell contains the successor state and output symbol, that's the nose-to-floor view, because you can only see what happens to one symbol at a time. The inverted view is some slightly different something or other. What is it?: Each input symbol has its own InSym table that comprises the function that we consider the symbol to be; really it's just a single row, the whole row is for that one input symbol not for each state, and the columns are one for each current/predecessor state so that the cells of the row combine the output symbol and the successor state which result from that input being accepted in that predecessor state.

What's the difference? Just rotate the table, rows for column and columns for rows, to get this inverted view. But now you get the joy of composition, and now the computation is taking functions to composed/combined functions, not functions to single steps along the path, to no more than just the next single state. Here the operation takes two whole functions and makes a whole new function, the composition of the input functions. Instead of taking one symbol to one state and one output symbol it will take two input symbols to one output state along with an output symbol sequence.

Each (predecessor) cell, which itself simply has the result stored in it, namely the pair or tuple [outSym, succState] = InSym(pred) function, can be composed with any successor cell (its row being given by the symbol that the FSA's input string gives for the next input symbol, and the column, which encodes the from-state, selected according to succState above. Just concatenate the outputs, and sequence the states accordingly. The output of the composition goes in a row for the input symbol PAIR, in the cell corresponding to the from-state and into that cell goes an output symbol *sequence* and the successor state of the successor cell. Of course we might have to do that for all the input symbols and matching successor cells, but only if we split up the input symbol sequence into onesies, then combine into length 2 inputs' composed functions... .

You can take it from here.

Z. G. Mou. The elements, structure, and taxonomy of divide-and-conquer. In Theory and Practice of Higher-Order Parallel Programming, Dagstul Seminar Report 169, pages 13–14, February 1997.

Jeffrey Yepez, Guy P. Seeley, and George Mou. Lattice-gas automata on parallel architectures. In Proceedings of International Symposium on Parallel Computation, pages 522–525, 1993. Beijing, China. (Cached copy here.)

(For review on lattice gas automata see The Classical Lattice-Gas Method by Jeffrey Yepez 1999, DTIC/ADA455835.pdf).

Mou, Z.G. (1999). Recursive individually distributed object. In: Rolim, J., et al. Parallel and Distributed Processing. IPPS 1999. Lecture Notes in Computer Science, vol 1586. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0097888. Paper presented at IPPS/SPDP Workshops 12 April 1999.

TLDR: This paper proposes an alternative model called Individually Distributed Object (IDO) which allows a single large object to be distributed over a network, thus providing a high level interface for the exploitation of parallelism inside the computation of each object which was left out of the distributed objects model.
Abstract: Distributed Objects (DO) as defined by OMG’s CORBA architecture provide a model for object-oriented parallel distributed computing. The parallelism in this model however is limited in that the distribution refers to the mappings of different objects to different hosts, and not to the distribution of any individual object. We propose in this paper an alternative model called Individually Distributed Object (IDO) which allows a single large object to be distributed over a network, thus providing a high level interface for the exploitation of parallelism inside the computation of each object which was left out of the distributed objects model. Moreover, we propose a set of functionally orthogonal operations for the objects which allow the objects to be recursively divided, combined, and communicate over recursively divided address space. Programming by divide-and-conquer is therefore effectively supported under this framework. The Recursive Individually Distributed Object (RIDO) has been adopted as the primary parallel programming model in the Brokered Objects for Ragged-network Gigaflops (BORG) project at the Applied Physics Laboratory of Johns Hopkins University, and applied to large-scale real-world problems.

Google's MapReduce paper MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, 2004.

Although this paper by Google is famous and considered the pinnacle of parallel processing, a principle that any solution coming from a company like Google must be unacceptable suggests it may have flaws. "How do you know a solution is bad? If it's from Google, that's enough." Reading between the lines, I notice that the paper fails to even express the basic mathematical facts of MapReduce, namely that (1) it is a higher order function (a function that takes functions as input and returns a function as output which may then be applied to data), that this particular higher-order function specifically requires (2) a unary function (call it map), and (3) an associative binary function, (call it reduce); that (4) the returned function applies the unary, map, function to each element of the input data vector, and the associative binary, reduce, function to the vector of map outputs, and, (5) obviously yet perhaps not obviously enough for these authors and their employer, an associative operation to summarize a vector of data can be applied using recursive even-odd division or left-right combination after division and therefore completed in log N time. For example in 3 time-steps + can be applied to 8 elements using ((a+b)+(c+d))+((e+f)+(g+h)) rather than using 7 time-steps to equivalently calculate (((((((a+b)+c)+d)+e)+f)+g)+h). Use recursive parallel decomposition!

To reiterate, although some parallelism is achieved it fails to take advantage of the deep parallelism of a recursive divide-and-conquer approach, so that log N time is not achieved. On the one hand, after a single M-ary division of the data, a number of linear-time subtasks are scheduled. In case of the map operation which is non-combinatory, if all the available processors are busy doing essential map work, that's all the parallelism available: all must be done independently and no ordering or building up of results saves operations. On the other hand, an associative reduce operation applied to the whole can be done either in head-tail (as they do) or left-right (as would be better) approaches. Let's count. Each of those could be done on 1 or M subdivisions of the map outputs, yielding O(N) vs O(log N) time results on 1 subdivision, or O(N/M) vs O(log(N/M)) time results on M subdivisions ignoring final combination/reduction. With N ~ 2^33 and M = 1, N vs log N = 2^33 vs 33; whereas, with N ~ 2^33 and M ~ 2^10, we still find N/M ~ 2^23, while log(N/M) ~ 23. That is, Dean's splits indeed save 3 orders of magnitude but nothing like log(2^23/23) ~ 23-5 = 18 orders of magnitude saved by recursive left-right division. The point is that an associative reduce function applied to a large dataset using left-right decomposition enables log N time, as an absence of clear thinking will indeed fail to notice, however obvious it may be. This is made explicit in a Divacon definition:

MR = PDC(d_lr,reduce,id,id,atom?,map);

Here an associative binary operation, reduce, is given as the combination function, which combines outputs while moving back up the computation tree after map has separately processed each input element at the computation tree's leaf nodes. A left-right-subdivided computation tree on 10^7 leafs is only 23 levels deep, reducing the task to O(log N) = 23 time steps, vastly less than the vaunted O(10^7) time steps.

Hence even division once over processors fails to achieve anywhere close to such improvements, despite admittedly achieving, say, 1000x improvements. Recursive parallel decomposition is the trick, which the Google paper misses.

Next, parallel communication opportunities are ignored (cf Optimal mappings, here), although tree based communication for that subset of communications that may be sent and delivered within sub-branches simultaneously, the basic Ethernet link design used in Google's pride here implies queued and sequential communications rather than simultaneous and parallel communications at least within the set of processors, or sets of sets of processors, so linked. And when communications are not perfectly local, the queueing/blocking bottlenecks at the top of the tree not only remain but are multiplied by combinatorically.. Therefore queuing and blocking are implicated and transmission characteristics are vastly narrowed. See also the 1995 Min-Cut paper, above; indeed do a Min-Cut transform on Google's network and you will see the relative limitations.

Zhijing G. Mou, Hai Liu, Paul Hudak, Compress-and-Conquer for Optimal Multicore Computing, In Proceedings of the POPL 2010 Workshop on Declarative Aspects of Multicore Programming, DAMP 2010, Madrid, Spain, January 19, 2010.
ACM link here, ResearchGate link here, cached copy here.

Abstract: The CC paradigm is implemented in Haskell as a modular, higher-order function, compiled into low-level Haskell threads that run on a multicore machine, and performance benchmarks confirm the theoretical analysis.

Patent WO2010075309A2: Method and System for Multidimensional and Geographic Search. International Pub. Date 1-July-2010. Cached copy here.

Z. George Mou, "Dimensional Shuffle-Transform for Non-Parametric Image Classification" at the International Conference on Digital Image Processing (ICDIP 2018) in Shanghai. SPIE Proceedings vol 10806.

Comments from George: The method of K-M Shuffle from the Optimal Mappings paper is in fact the same as DST itself (where K>1 is small and M=1). Furthermore given K=3 and M=2 then a point cloud of 3D points (LIDAR or optical) can be mapped to a key or point cloud of 2D keys using DST, and then the usual convolutional neural network (CNN) which is excellent at 2D image classification but terrible at 3D image classification can be used to classify 3D-to-2D-DST-transformed 3D point clouds of 3D objects such as car, tree, pedestrian, bicycle, sign, pole, etc., and although the images viewed in 2D do not remotely appear to a human viewer as cars trees etc., still the CNN is great at learning to recognize those object classes from the transformed-to-2D data.
However the field is unaware of this amazing capability and may remain so indefinitely. Indeed there are many beautiful properties of DST, but it seems impossible to explain them to anyone so that they truly understand them all.
TV note: asking deepseek ai about this in general terms after George mentioned it to me, the AI found this paper with this title given at this conference, which does appear to be real, but it hallucinated the actual publication which is not actually in those proceedings. And George for his part says he has a presentation of it, but himself does not have a full paper.

dvkang.com. 2D DST demo website, at least.

Z. George Mou, Black-White Array: A New Data Structure for Dynamic Data Sets. April 2020, arXiv preprint here. cached copy here.

Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication Viterbi Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology, Haifa, 3200003, Israel. ArXiv 2304.02336v1 [cs.AR] 5 Apr 2023.

Patent US20230351547A1, Methods and control systems that use dimensional-transform-based three-dimensional searching and voxel mapping. Published 02-November-2023. (Cached copy here.)

Notes on the DST patent here.

George wants to communicate the idea in comparing KD Tree vs DST that DST search time is related (proportional) to size of searched region, if small then small, whereas in KD tree small search region is related to (total) tree size not searched region size, so 27M points in the search universe will makes KDTree slow, but not DST, which just zooms right in and ignores the rest. That's a huge difference.

	Your thoughts?
Name⁺:
Email^*:	(will not be shared or abused)
Comment:
	Feedback is welcome.

Zhijing George Mou
Bibliography

Copyright © 2024-25 Thomas C. Veatch. All rights reserved.
Created: June 13, 2024; Modified June 17, 2024, May 8, 2025

Copyright © 2024-25 Thomas C. Veatch. All rights reserved. Created: June 13, 2024; Modified June 17, 2024, May 8, 2025

Copyright © 2024-25 Thomas C. Veatch. All rights reserved.
Created: June 13, 2024; Modified June 17, 2024, May 8, 2025