Ordered by date of original research or of publication.
English version translation © 2001 M.E. Sharpe, Inc., from the
Chinese original, in Contemporary Chinese Thought, vol. 32,
no. 4, Summer 2001, pp. 17–36. cached copy
here.
"The author of this essay, Yu Luoke, was a young worker and
thinker in Beijing. Because of this writing, he was executed by
the government in 1970."
Cached copy here
In this standard nose-on-floor view, you know the current state of the
machine, supply an input, and you get an output and a transition
to a new state.
For example, the same input may produce different outputs if it is
provided in two different states.
This paper is about doing this in parallel.
Suppose you have a FSA with \(10^6\) or \(10^9\) inputs, the intuition is you have to
give it one symbol at a time to handle, each time step you get a
symbol, assuming it is in one state, look at the table and produce one
output.
But is it possible to do this in less time than 1 per symbol? This
paper claims you can do it in logarithmic time, so \(10^{6}=2^{20},
10^{9} = 2^{30}\), so in 20 or 30 steps you can get all the million or
billion (if 1-1) output symbols.
However, you could rotate the perspective (Tom: or invert it, or lift nose
off the floor), and consider that an input symbol is itself the
function, it applies to one state and returns the same pair; the
output and the successor state.
If the nature of this global-perspective FSA function is
... then: you could compose these functions. As it turns out,
function composition scales! Can you compose \(f_{1-2} = f_2 \cdot f_1\),
in unit time?
Apply this in a Divacon Left-Right Postadjust computation graph, boom
done, you have a doubled and doubled and doubled (as deep as you want
to go) composition of compositions of compositions of InputSymbolic
FST *functions*, finally at the top of the postadjust tree, the last
function maps from the first input state to the final output state
consuming the entire input sequence and outputting the whole output
symbol sequence.
Did I slip one past you? Actually I didn't. That's the whole thing.
But if you think back on it and now you don't think you got it, please
read it twice more, make informal drawings of it three or four times,
then you may want to learn about Divacon, post-adjustment and the Left-Right
divide/combine functions, and visualize how the computation graph
might look.
Finally, figure out what these objects are, the functions
that are composed: Are they lookup tables? -- one row for each state with
one column per input symbol, and the cells are empty if that input
symbol is not accepted in that state, but when not empty the cell
contains the successor state and output symbol, that's the
nose-to-floor view, because you can only see what happens to one
symbol at a time. The inverted view is some slightly different
something or other. What is it?: Each input symbol has its own InSym
table that comprises the function that we consider the symbol to be;
really it's just a single row, the whole row is for that one input
symbol not for each state, and the columns are one for each
current/predecessor state so that the cells of the row combine the
output symbol and the successor state which result from that input
being accepted in that predecessor state.
What's the difference? Just rotate the table, rows for column and
columns for rows, to get this inverted view. But now you get the joy
of composition, and now the computation is taking functions to
composed/combined functions, not functions to single steps along the
path, to no more than just the next single state. Here the operation
takes two whole functions and makes a whole new function, the
composition of the input functions. Instead of taking one symbol to
one state and one output symbol it will take two input symbols to one
output state along with an output symbol sequence.
Each (predecessor) cell, which itself simply has the result stored in
it, namely the pair or tuple [outSym, succState] = InSym(pred)
function, can be composed with any successor cell (its row being given
by the symbol that the FSA's input string gives for the next input
symbol, and the column, which encodes the from-state, selected
according to succState above. Just concatenate the outputs, and
sequence the states accordingly. The output of the composition goes
in a row for the input symbol PAIR, in the cell corresponding to the
from-state and into that cell goes an output symbol *sequence* and the
successor state of the successor cell. Of course we might have to do
that for all the input symbols and matching successor cells, but only
if we split up the input symbol sequence into onesies, then combine
into length 2 inputs' composed functions... .
You can take it from here.
(For review on lattice gas automata see
The Classical Lattice-Gas Method by Jeffrey Yepez 1999,
DTIC/ADA455835.pdf).
Abstract: Distributed Objects (DO) as defined by OMG’s
CORBA architecture provide a model for object-oriented
parallel distributed computing. The parallelism in this model
however is limited in that the distribution refers to the
mappings of different objects to different hosts, and not to
the distribution of any individual object. We propose in this
paper an alternative model called Individually Distributed
Object (IDO) which allows a single large object to be
distributed over a network, thus providing a high level
interface for the exploitation of parallelism inside the
computation of each object which was left out of the
distributed objects model. Moreover, we propose a set of
functionally orthogonal operations for the objects which allow
the objects to be recursively divided, combined, and
communicate over recursively divided address
space. Programming by divide-and-conquer is therefore
effectively supported under this framework. The Recursive
Individually Distributed Object (RIDO) has been adopted as the
primary parallel programming model in the Brokered Objects for
Ragged-network Gigaflops (BORG) project at the Applied Physics
Laboratory of Johns Hopkins University, and applied to
large-scale real-world problems.
Although this paper by Google is famous and considered the
pinnacle of parallel processing, a principle that any solution
coming from a company like Google must be unacceptable suggests
it may have flaws. "How do you know a solution is bad? If it's
from Google, that's enough." Reading between the lines, I notice
that the paper fails to even express the basic mathematical facts
of MapReduce, namely that (1) it is a higher order function (a
function that takes functions as input and returns a function as
output which may then be applied to data), that this particular
higher-order function specifically requires (2) a unary function
(call it map), and (3) an associative binary function, (call it
reduce); that (4) the returned function applies the unary, map,
function to each element of the input data vector, and the
associative binary, reduce, function to the vector of map
outputs, and, (5) obviously yet perhaps not obviously enough for
these authors and their employer, an associative operation to
summarize a vector of data can be applied using recursive
even-odd division or left-right combination after division and
therefore completed in log N time. For example in 3 time-steps +
can be applied to 8 elements using ((a+b)+(c+d))+((e+f)+(g+h))
rather than using 7 time-steps to equivalently calculate
(((((((a+b)+c)+d)+e)+f)+g)+h). Use recursive parallel
decomposition!
To reiterate, although some parallelism is achieved it fails to
take advantage of the deep parallelism of a recursive
divide-and-conquer approach, so that log N time is not achieved.
On the one hand, after a single M-ary division of the data, a
number of linear-time subtasks are scheduled. In case of the map
operation which is non-combinatory, if all the available
processors are busy doing essential map work, that's all the
parallelism available: all must be done independently and no
ordering or building up of results saves operations. On the
other hand, an associative reduce operation applied to the whole
can be done either in head-tail (as they do) or left-right (as
would be better) approaches. Let's count. Each of those could be
done on 1 or M subdivisions of the map outputs, yielding O(N) vs
O(log N) time results on 1 subdivision, or O(N/M) vs O(log(N/M))
time results on M subdivisions ignoring final
combination/reduction. With N ~ 2^33 and M = 1, N vs log N =
2^33 vs 33; whereas, with N ~ 2^33 and M ~ 2^10, we still find
N/M ~ 2^23, while log(N/M) ~ 23. That is, Dean's splits indeed
save 3 orders of magnitude but nothing like log(2^23/23) ~ 23-5 =
18 orders of magnitude saved by recursive left-right division.
The point is that an associative reduce function applied to a
large dataset using left-right decomposition enables log N time,
as an absence of clear thinking will indeed fail to notice,
however obvious it may be. This is made explicit in a Divacon definition:
Here an associative binary operation, reduce, is given as the
combination function, which combines outputs while moving back up
the computation tree after map has separately processed each
input element at the computation tree's leaf nodes. A
left-right-subdivided computation tree on 10^7 leafs is only 23
levels deep, reducing the task to O(log N) = 23 time steps,
vastly less than the vaunted O(10^7) time steps.
Hence even division once over processors fails to achieve
anywhere close to such improvements, despite admittedly
achieving, say, 1000x improvements. Recursive parallel
decomposition is the trick, which the Google paper misses.
Next, parallel communication opportunities are ignored (cf
Optimal mappings, here), although tree based communication for
that subset of communications that may be sent and delivered
within sub-branches simultaneously, the basic Ethernet
link design used in Google's pride here implies queued and
sequential communications rather than simultaneous and parallel
communications at least within the set of processors, or sets of
sets of processors, so linked. And when communications are not
perfectly local, the queueing/blocking bottlenecks at the top of
the tree not only remain but are multiplied by combinatorically..
Therefore queuing and blocking are implicated and transmission
characteristics are vastly narrowed. See also the 1995 Min-Cut
paper, above; indeed do a Min-Cut transform on Google's network
and you will see the relative limitations.
My notes here.
However the field is unaware of this amazing capability and may
remain so indefinitely. Indeed there are many beautiful
properties of DST, but it seems impossible to explain them to
anyone so that they truly understand them all.
TV note: asking deepseek ai about this in general terms after
George mentioned it to me, the AI found this paper with this
title given at this conference, which does appear to be real, but
it hallucinated the actual publication which is not actually in
those proceedings. And George for his part says he has a
presentation of it, but himself does not have a full paper.
Notes on the DST patent here.
George wants to communicate the idea in comparing KD Tree vs DST that DST
search time is related (proportional) to size of searched region,
if small then small, whereas in KD tree small search region is
related to (total) tree size not searched region size, so 27M
points in the search universe will makes KDTree slow, but not
DST, which just zooms right in and ignores the rest. That's a
huge difference.
|