Every solution is either trivial or false.
| |
| -- Zhijing George Mou. |
Ordered by date of original research or of publication.
C. F. Gauss, "Nachlass, Theoria Interpolationis Methodo Nova
Tractata," in Carl Friedrich Gauss Werke, Band 3, Koniglichen
Gesellschaft der Wissenschaften: Gottingen, pp. 265-330, 1866.
H. Burkhardt,"Trigonometrische Interpolation," in Chapter 9, Encyklopiidie der
Mathematischen Wissenschaften, vol. 2, part 1, 1st half, 1899-1916.
H. Burkhardt,"Trigonometrische Reihenund Integrale(bis etwa
1850)," in Chapter 12, Encyklopiidie der Mathematischen
Wissenschaften,vol. 2, part 1, 2nd half, 1904-1916.
J. W. Cooley Y & J. W. Tukey, "An Algorithm for the Machine
Calculation of Complex Fourier Series," Math. Comput., Vol. 19,
no. 2, pp. 297-301, Reprinted in Digital
Signal Processing, ed. Larry R. Rabiner & C. M. Rader, pp. 223-227, New
York: IEEE Press, 1972, Apr. 1965.
Alan Perlis, The Synthesis of Algorithmic Systems, The 1st Turing
Lecture. (ACM V 14 No 1, 1967).(cached copy here)
Yu Luoke. “Chu shen lun" (On Family Background). Beijing
1966-1967. In Journal of Middle Schools Cultural Revolution,
edited/published by Zhijing Mou and the Family Background Study
Group of Beijing.
English version translation © 2001 M.E. Sharpe, Inc., from the
Chinese original, in Contemporary Chinese Thought, vol. 32,
no. 4, Summer 2001, pp. 17–36. cached copy
here.
TLDR:
This paper argues that all people are created equal.
M. E. Sharpe: "This essay was originally published on January
18, 1967 in the premiere issue of the capital’s Journal of Middle
Schools Cultural Revolution. Its first draft was completed in
August 1966, and it was subsequently revised in November 1966.
"The author of this essay, Yu Luoke, was a young worker and
thinker in Beijing. Because of this writing, he was executed by
the government in 1970."
Donald Fraser. Array permutation by index-digit
permutation. Journal of ACM, 23(2):298—309, April 1976.
The 1977
Turing Lecture. Can Programming Be Liberated from the von
Neumann Style? A Functional Style and Its Algebra of Programs, by
John Backus, IBM Research Laboratory, San Jose CA
Michael T. Heideman, Don H. Johnson, C. Sidney Burrus, "Gauss and
the History of the Fast Fourier Transform", 1985. Rice U
Eng. Dept.
cached copy here.
Z. G. Mou and P. Hudak. An algebraic model for divide-and-conquer
algorithms and its parallelism. The Journal of Supercomputiug,
2(3):257-278, November 1988.
cached copy.
The design and analysis of parallel algorithms / by Selim G. Akl.
1989, Prentice Hall textbook,
cached copy here.
This was the state of the art before George Mou.
Zhijing George Mou, A Formal Model for Divide-and-Conquer and Its
Parallel Realization, May 1990, Research Report
YALEU/DCS/RR-795. (Yale
PhD Thesis) (cached copy here.
Z. G. Mou. Divacon: A parallel language for scientific computing
based on divide-and-conquer. In Proceedings of the Third
Symposium on the Frontiers of Massively Parallel Computation,
pages 451-461. IEEE, October 1990. Link.
Abstract:
An overview of the language, covering Divacon primitives and
simple programming constructs that are referred to as
functional forms, is given. Two divide-and-conquer programming
constructs are discussed. Divacon style programming is
demonstrated for a number of scientific applications. Some
interesting equivalences and transformations between Divacon
programs are examined. Implementation and performance are
briefly considered.
Z. G. Mou and M. Goodman, "A Comparison of Communication Costs
for Three Parallel Programming Paradigms on Hypercube and Mesh
Architectures". Proc. 5th SIAM Conf. on Parallel Processing
p491-500, held in Houston TX March 1991.
ACM
link here.
B. Carpentieri, Z. G. Mou, Compile-time transformations and
optimization of parallel Divide-and-Conquer algorithms Published
in SIGP 1 October 1991. ACM SIGPLAN Notices. Link.
Abstract:
The goal is to make Divide-and-Conquer algorithms suitable
for implementation on hypercube-like parallel architectures,
such as the Connection Machine, even if the original
algorithm implies a division function that is not the
left-right division and/or communication pattern that cannot
be implemented by direct-neighbor communication
Marc Goodman, Z. G. Mou, DC Transpose: a method for reducing
communication in divide-and-conquer algorithms on mesh-based
computers. Published in Proceedings of the Third IEEE Symposium
on Parallel and Distributed Processing 2 December 1991.
Abstract:
The paper introduces the notion of a DC Transpose, which is a
remapping of data on a mesh, using a global routing mechanism
such as the global router on the Mas-Par MP-1, to solve
divide-and-conquer algorithms.
Xiaojing Wang, Z. G. Mou. "A divide-and-conquer method of
solving tridiagonal systems on hypercube massively parallel
computers" Proceedings of the Third IEEE Symposium on Parallel
and Distributed Processing. 2 December 1991. Link here.
Abstract: The authors present a new parallel algorithm,
based on the divide-and-conquer (DC) strategy, for solving
tridiagonal systems, and show that for the binary hypercube
architecture, the communication complexity of their DC method
is the lowest among all, and therefore the most
efficient tridiagonal solver for communication-expensive
massively parallel hypercube computers.
Z. G. Mou, C. Constantinescu, and T. Hickey. Divide-and-conquer
on a 3-dimensional mesh. In Proceedings of the European Workshops
on Parallel Computing, pages 344-355, Barcelona, Spain, March
1992.
Z.G.Mou, Cornel Constantinescu, and T. Hickey. Optimal mappings
of divide-and-conquer algorithms to mesh connected parallel
architectures. In Proceedings of International Computer
Symposium, pages 273-284, Taiwan, December 1992.
Z. G. Mou, Xiaojing Wang, T. Hickey. Divide-and-Conquer
Algorithms with Recursive Broadcast Communication on
Reconfigurable Arbitrary Dimensional Mesh. Published in SIAM
Conference on Parallel Processing for Scientific Computing. 1993.
Z. George Mou and Xiaojing Wang, Optimal Mappings of m
Dimensional FFT Communication to k Dimensional Mesh for Arbitrary
m and k. pp 104-119. in PARLE '93, Parallel Architectures and
Languages Europe, Lecture Notes in Computer Science #694,
Springer-Verlag. 5th International PARLE Conference, Munich,
Germany, June 1993 Proceedings.
cached copy here.
Context and story here.
Z.G. Mou. A Divide-and-Conquer Parallel Algorithm for Banded
Linear Systems. PP, 1995.
Jayadev Misra, Powerlist: A Structure for Parallel Recursion. ACM
Transactions. on Programming Languages and Systems, Vol 16, N. 6
November 1994, Pages 1737-1767. Cached copy here.
Z.G. Mou. "Spectrum analysis and min-cut transformation
of communication networks in parallel computers" Published in:
Proceedings Frontiers '95. The Fifth Symposium on the Frontiers
of Massively Parallel Computation (IEEE) Conference in McLean,
VA, USA on 6-9 February 1995. Link here.
(cached copy here)
Abstract: We present a formal model for the
analysis of communication networks in parallel computers. Unlike
most others, our model focuses on the transmission delays as
opposed to the propagation delays of communication patterns. The
model allows all symmetric communication networks to be examined
by their spectrums and characterized by their transmission
dimensions. A min-cut transformation is introduced as a tool for
the spectrum analysis, which reduces any symmetric network to the
same canonical form. Parallel architectures with different
topologies can then be easily compared and
evaluated.
Z. G. Mou. The elements, structure, and taxonomy of
divide-and-conquer. In Theory and Practice of Higher-Order
Parallel Programming, Dagstul Seminar Report 169, pages 13–14,
February 1997.
Jeffrey Yepez, Guy P. Seeley, and George Mou. Lattice-gas
automata on parallel architectures. In Proceedings of
International Symposium on Parallel Computation, pages 522–525,
1993. Beijing, China. (Cached copy here.)
(For review on lattice gas automata see
The Classical Lattice-Gas Method by Jeffrey Yepez 1999,
DTIC/ADA455835.pdf).
Mou, Z.G. (1999). Recursive individually distributed object. In:
Rolim, J., et al. Parallel and Distributed Processing. IPPS
1999. Lecture Notes in Computer Science, vol 1586. Springer,
Berlin, Heidelberg . https://doi.org/10.1007/BFb0097888. Paper
presented at IPPS/SPDP Workshops 12 April 1999.
TLDR:
This paper proposes an alternative model called Individually
Distributed Object (IDO) which allows a single large object to
be distributed over a network, thus providing a high level
interface for the exploitation of parallelism inside the
computation of each object which was left out of the
distributed objects model.
Abstract: Distributed Objects (DO) as defined by OMG’s
CORBA architecture provide a model for object-oriented
parallel distributed computing. The parallelism in this model
however is limited in that the distribution refers to the
mappings of different objects to different hosts, and not to
the distribution of any individual object. We propose in this
paper an alternative model called Individually Distributed
Object (IDO) which allows a single large object to be
distributed over a network, thus providing a high level
interface for the exploitation of parallelism inside the
computation of each object which was left out of the
distributed objects model. Moreover, we propose a set of
functionally orthogonal operations for the objects which allow
the objects to be recursively divided, combined, and
communicate over recursively divided address
space. Programming by divide-and-conquer is therefore
effectively supported under this framework. The Recursive
Individually Distributed Object (RIDO) has been adopted as the
primary parallel programming model in the Brokered Objects for
Ragged-network Gigaflops (BORG) project at the Applied Physics
Laboratory of Johns Hopkins University, and applied to
large-scale real-world problems.
Google's
MapReduce paper MapReduce: Simplified Data Processing on
Large Clusters Jeffrey Dean and Sanjay Ghemawat, 2004.
Although this paper by Google is famous and considered the
pinnacle of parallel processing, a principle that any solution
coming from a company like Google must be unacceptable suggests
it may have flaws. "How do you know a solution is bad? If it's
from Google, that's enough." Reading between the lines, I notice
that the paper fails to even express the basic mathematical facts
of MapReduce, namely that (1) it is a higher order function (a
function that takes functions as input and returns a function as
output which may then be applied to data), that this particular
higher-order function specifically requires (2) a unary function
(call it map), and (3) an associative binary function, (call it
reduce); that (4) the returned function applies the unary, map,
function to each element of the input data vector, and the
associative binary, reduce, function to the vector of map
outputs, and, (5) obviously yet perhaps not obviously enough for
these authors and their employer, an associative operation to
summarize a vector of data can be applied using recursive
even-odd division or left-right combination after division and
therefore completed in log N time. For example in 3 time-steps +
can be applied to 8 elements using ((a+b)+(c+d))+((e+f)+(g+h))
rather than using 7 time-steps to equivalently calculate
(((((((a+b)+c)+d)+e)+f)+g)+h). Use recursive parallel
decomposition!
To reiterate, although some parallelism is achieved it fails to
take advantage of the deep parallelism of a recursive
divide-and-conquer approach, so that log N time is not achieved.
On the one hand, after a single M-ary division of the data, a
number of linear-time subtasks are scheduled. In case of the map
operation which is non-combinatory, if all the available
processors are busy doing essential map work, that's all the
parallelism available: all must be done independently and no
ordering or building up of results saves operations. On the
other hand, an associative reduce operation applied to the whole
can be done either in head-tail (as they do) or left-right (as
would be better) approaches. Let's count. Each of those could be
done on 1 or M subdivisions of the map outputs, yielding O(N) vs
O(log N) time results on 1 subdivision, or O(N/M) vs O(log(N/M))
time results on M subdivisions ignoring final
combination/reduction. With N ~ 2^33 and M = 1, N vs log N =
2^33 vs 33; whereas, with N ~ 2^33 and M ~ 2^10, we still find
N/M ~ 2^23, while log(N/M) ~ 23. That is, Dean's splits indeed
save 3 orders of magnitude but nothing like log(2^23/23) ~ 23-5 =
18 orders of magnitude saved by recursive left-right division.
The point is that an associative reduce function applied to a
large dataset using left-right decomposition enables log N time,
as an absence of clear thinking will indeed fail to notice,
however obvious it may be. This is made explicit in a Divacon definition:
MR = PDC(d_lr,reduce,id,id,atom?,map);
Here an associative binary operation, reduce, is given as the
combination function, which combines outputs while moving back up
the computation tree after map has separately processed each
input element at the computation tree's leaf nodes. A
left-right-subdivided computation tree on 10^7 leafs is only 23
levels deep, reducing the task to O(log N) = 23 time steps,
vastly less than the vaunted O(10^7) time steps.
Hence even division once over processors fails to achieve
anywhere close to such improvements, despite admittedly
achieving, say, 1000x improvements. Recursive parallel
decomposition is the trick, which the Google paper misses.
Next, parallel communication opportunities are ignored (cf
Optimal mappings, here), although tree based communication for
that subset of communications that may be sent and delivered
within sub-branches simultaneously, the basic Ethernet
link design used in Google's pride here implies queued and
sequential communications rather than simultaneous and parallel
communications at least within the set of processors, or sets of
sets of processors, so linked. And when communications are not
perfectly local, the queueing/blocking bottlenecks at the top of
the tree not only remain but are multiplied by combinatorically..
Therefore queuing and blocking are implicated and transmission
characteristics are vastly narrowed. See also the 1995 Min-Cut
paper, above; indeed do a Min-Cut transform on Google's network
and you will see the relative limitations.
Zhijing G. Mou, Hai Liu, Paul Hudak, Compress-and-Conquer for
Optimal Multicore Computing, In Proceedings of the POPL 2010
Workshop on Declarative Aspects of Multicore Programming, DAMP
2010, Madrid, Spain, January 19, 2010.
ACM link here,
ResearchGate link here,
cached copy here.
Abstract:
The CC paradigm is implemented in Haskell as a modular,
higher-order function, compiled into low-level Haskell threads
that run on a multicore machine, and performance benchmarks
confirm the theoretical analysis.
Patent WO2010075309A2:
Method and System for Multidimensional and Geographic Search.
International Pub. Date 1-July-2010. Cached copy here.
Z. George Mou, "Dimensional Shuffle-Transform for Non-Parametric
Image Classification" at the International Conference on Digital Image
Processing (ICDIP 2018) in Shanghai. SPIE Proceedings vol 10806.
Comments from George: The method of K-M Shuffle from
the Optimal Mappings paper is in fact the same as DST itself.
Furthermore given K=3 and M=2 then a point cloud of 3D points
(LIDAR or optical) can be mapped to a key or point cloud of 2D
keys using DST, and then the usual convolutional neural network
(CNN) which is excellent at 2D image classification but terrible
at 3D image classification can be used to classify
3D-to-2D-DST-transformed 3D point clouds of 3D objects such as
car, tree, pedestrian, bicycle, sign, pole, etc., and although
the images viewed in 2D do not remotely appear to a human viewer
as cars trees etc., still the CNN is great at learning to
recognize those object classes from the transformed-to-2D
data.
However the field is unaware of this amazing capability and may
remain so indefinitely. Indeed there are many beautiful
properties of DST, but it seems impossible to explain them to
anyone so that they truly understand them all.
TV note: asking deepseek ai about this in general terms after
George mentioned it to me, the AI found this paper with this
title given at this conference, which does appear to be real, but
it hallucinated the actual publication which is not actually in
those proceedings. And George for his part says he has a
presentation of it, but himself does not have a full paper.
dvkang.com. 2D DST demo website, at least.
Z. George Mou, Black-White Array: A New Data Structure for
Dynamic Data Sets. April 2020, arXiv preprint here. cached
copy here.
Orian Leitersdorf, Yahav Boneh, Gonen Gazit, Ronny Ronen, Shahar
Kvatinsky, FourierPIM: High-Throughput In-Memory Fast Fourier
Transform and Polynomial Multiplication Viterbi Faculty of
Electrical and Computer Engineering, Technion – Israel Institute
of Technology, Haifa, 3200003, Israel. ArXiv 2304.02336v1
[cs.AR] 5 Apr 2023.
Patent US20230351547A1, Methods and control systems that use
dimensional-transform-based three-dimensional searching and voxel
mapping. Published 02-November-2023. (Cached copy here.)
Notes on the here.
George wants to communicate the idea in comparing KD Tree vs DST that DST
search time is related (proportional) to size of searched region,
if small then small, whereas in KD tree small search region is
related to (total) tree size not searched region size, so 27M
points in the search universe will makes KDTree slow, but not
DST, which just zooms right in and ignores the rest. That's a
huge difference.
|