Fibonacci is the difference equation
\(F(n)\ \text{≜}\ F(n-1) + F(n-2)\)grounded on \(F(0)\ \text{≜}\ F(1)\ \text{≜}\ 1\)or more conveniently for me (and equivalently) \(F(-2)\ \text{≜}\ 1\ \ \ \) andA notation using capital letters and apostrophes is often tidy: \(N'\ \text{≜}\ F(n-1)\ \ \) andNaively by the definition, each requiring its immediate predecessors to calculate it, F(n) for large \(n\) seems to require O(n) operations. F(n) for \(n>2^{1000}\) would require huge multiples of the age of the universe, done serially. Fortunately, a general recurrence exists:
\(F(m+n) = F(m)*F(n-2) + F(m+1)*F(n-1) = (M'+M'')N'' + (2M'+M'')N' = M'N'' + M''N'' + 2M'N' + M''N'\)Commutativity, \(F(m+n)=F(n+m)\), follows by inspection on each side of the = sign. Grounding on F(0)=0 (as in the link) instead of (as here) F(0)=1, shifts the sequence by one position.) If \(n=m\) then this yields \(F(2n)\) and can be applied repeatedly -- at any scale of course. Starting from \(n=1\), repeating the calculation \(a\) times yields \(F(2^a)\). To double the length of a vector of Fibonacci values, calculations require already-in-hand values. With zero-based indexing (F(0)=1), vectors count 0..n-1; so F(n) is not already-in-hand. But F(n-1) and F(n-2), the elements of the difference equation, are already-in-hand, and we can encapsulate the pair of them as a trivial codomain for the Fibonacci domain:
\(pair(n)\ \text{≜}\ N\ \text{≜}\ \begin{bmatrix}F(n-1) \cr F(n-2)\end{bmatrix} = \begin{bmatrix}N' \cr N''\end{bmatrix} \)The reverse mapping is \(get\ N = F(n-1)+F(n-2)\).Clearly \(F(n) = get\ pair(n)\). Is there an operation within this codomain corresponding to + between domain indexes? From indexes \(m, n\) which combine with + to pick out \(F(m+n)\) in the Fibonacci domain, I would like to find an operation # which combines elements in the codomain. \(M\#N\) would be, then, the corresponding value (a pair) in the codomain: \( M\#N = \begin{bmatrix}F(m+n-1) \cr F(m+n-2)\end{bmatrix} \)Then, indeed, \(get\ M\#N = F(m+n)\), so that + applied to indexes in the Fibonacci domain indeed corresponds to # applied to codomain elements in the Fibonacci codomain. Working this out, substitute \(m-1\) for \(m\) in the recurrence to yield: \( \begin{aligned} F((m -1) + \ \ n) & = F(m-1)*F(n-2) + F(m-1+1)*F(n-1) = F(m-1)*F(n-2) + [F(m-1)+F(m-2)]* F(n-1) \\ & = M'N'' + (M'+M'')N' = M'(N' + N'') + M''N' \end{aligned} \)Next substitute \(m-2\) for \(m\) to yield: \( \begin{aligned} F((m-2) + n) & = F(m-2)*F(n-2) + F(m-1)*F(n-1)\\ & = M''N''+M'N' \end{aligned} \)Rearranging; \( M\#N = \begin{bmatrix} M'(N'+N'') + M''N' \cr M'N' + M''N'' \end{bmatrix} \)If \(m=n\) then \( M\#M = \begin{bmatrix} (M')^2 + 2M'M''\cr (M')^2 + (M'')^2 \end{bmatrix} \)Incidentally \(N\#M=M\#N\) (commutativity) and \(M_1\#(M_2\#M_3) = (M_1\#M_2)\#M_3\) (associativity) follow from the definition. In Divacon notation, \(F\ V\) applies \(F\) to a vector \(V\) of \(2^m\) indexes \(0..2^m-1\). Including \(get\) to retrieve values from pairs, we define F as: To describe this in words, our divide and combine are balanced left-right; there is no preadjust function, while the post-adjust function communicates between corresponding vector elements, and carries out asymmetric local calculations in which the left side calculates but does not add (+ in domain, # in co-domain) the bit-shifted \(M\), retaining the value already developed, while the right side redundantly does the corresponding calculation for a bit shift of \(M\), and does add (using #) the previous \(M_R\) to the previously developed value \(N_R\). The following divacon computation graph is discussed below.
\(F [0..2^3-1]\)
Here the left-right division is uninterrupted until the inputs are atomic vectors containing an index; then the base function \(f_b\) drops the (position-redundant) index number itself and inserts base values \(M_1=\begin{bmatrix}1 \cr 0 \end{bmatrix}\) and \(N_0=\begin{bmatrix}F(-1) \cr F(-2) \end{bmatrix} =\begin{bmatrix}0 \cr 1 \end{bmatrix}\) into a 2-tuple for each value. This 2-tuple, \(\lt M,N\gt \), serves to hold interim results: M corresponding to a single-bit index \(i\) (as in \(i=0x1\ll a, M=\begin{bmatrix}I'\cr I''\end{bmatrix}\)) and N corresponding to the entire value built up so far (the least-significant \(a\) bits of the index). As combination brings levels together one level at a time, \(M\) is modified to hold the Fibonacci codomain value for an index one bit higher, namely, from that corresponding to \(F(m)=F(2^a)\) to that corresponding to \(F(2^{a+1})\), which is \(M\#M\). \(N\) is unmodified on the low (left) side, since combination has effectively unmasked a zero in the next higher bit of its index, and on the high side \(N\) becomes \(M\#N\), since a high bit is necessarily present and always a 1 in the bit representation of the indexes on the right side. Thus \(X_R\) calculating \(M\#N\) in the codomain is equivalent to adding indexes \(2^a+n\) in the Fibonacci index domain. When \(c_{LR}\) combination is no longer possible, no higher bits remain to be unmasked in the original indexes, and the values in the N codomain element of each tuple are final. In the end, \(!\ Get\) applies to each codomain tuple, ignoring M and adding the components of N, to yield the final values \(F(i)\), in order. Division and combination, being balanced, clearly map from a \(2^m\) length vector to atomic vectors in \(m\) steps, and the operations \(f_b, X,\) and \(\#\ !\ corr\) are unit-time operations, so this algorithm will require \(O(a)=O(log(m))\) time in parallel. Previous verbose discussion on Fibonacci is here, and a graphical display of a Fibonacci sequence is here.
NotesA matrix multiplication version of Fibonacci index addition is available but somewhat slower. There, the codomain comprises 2x2 matrices built up from \(M_1= \begin{bmatrix} 1\ \ 1 \cr 1\ \ 0 \end{bmatrix}\); the operation # is normal matrix multiply (which sums the exponents of two recursively-multiplied matrices), and \(get\ M\) extracts top row last element from M.You may observe that \(M\#M\) is calculated redundantly throughout the computation graph, and wonder if there is an optimization eliminating that redundancy. However, these are done in parallel so that they add no time cost to the process as a whole, and their results are required independently within each subtree so calculating once and communicating it everywhere would add much delay as compared with this approach which calculates everywhere redundantly and needs no communication. Moving data across buses and between cache levels can cost 10x, 100x, 1000x the time delay of doing calculations within a CPU, so communication needs to be thought about carefully and minimized, for example using the Optimal Mapping approach discussed in Mou & Wang 1993, and in my notes on that paper. You may also note that each \(c_{lr}\) output is the same at each level, so that correspondent communication could be used to populate an empty high side, or in order words, the vector of values may doubled in size at each level. But that is a different idea: instead of \(F\ V\) your new idea would be a new function \(F a\) with \(a\) the number of doublings, as well as other ramifications (to drop \(d_{lr}\) from the paradigm, to store and to decrement \(a\) at each level, to perhaps add processors and memory at each doubling).
BibliographyZ. George Mou and Xiaojing Wang, Optimal Mappings of m Dimensional FFT Communication to k Dimensional Mesh for Arbitrary m and k. pp 104-119. in PARLE '93, Parallel Architectures and Languages Europe, Lecture Notes in Computer Science #694, Springer-Verlag. 5th International PARLE Conference, Munich, Germany, June 1993 Proceedings.
|