"... to illustrate him is to cultivate real Science, and to make his Discoveries easy and familiar will be no small improvement in Mathematicks and Philosophy."
(John Colson, 1736, in the dedication to his translation from Latin to English of Isaac Newton's 'Method of Fluxions and Infinite Series' (original, 1671, unpublished).)
Now, the best value for any weight in a neural network is the value at which the errors only get worse no matter which direction or distance you move away from it. This value is where the errors are minimum, and being at the bottom of the error function, it's also where the slope of the error function is zero. For each weight, then, we will find the slope of the error function, and find where that slope has the value, zero.
So the whole job is to find a zero-crossing. I learned "Newton's method" for this in high school. Did you? It's worth review.
Given a point on your function, find its slope, then triangulate down from that point, following the line with that slope to its zero-crossing point; that's your new point. Repeat.Hold on: An approximation of what? In our task of weight re-estimation in Neural Networks, the zero we seek is not the zero of the error functions \(\large e\) (or even of \(\large E = \sum e\)), but the zero values of their derivatives, \(\large e'\), at which errors do not get any better by modifying the weights \(w_{i,j}\). Zero slope occurs at the bottom: where the slope of the error function is zero.
We will find our line using a point+slope approach to drawing lines: a point on the line \((w,e'(w))\), and the slope of it, \(e''(w)\) lets you draw from that point, on that slope, up and down, as far as you like, to extrapolate to every point of it. In particular we follow our line from the current estimate \((e'(w))\) of our function, given weight \(w\), to where it crosses zero, except that to do so we will need the slope \(e''\) of the derivative \(e'\) at this location. Did I make sense?
Let's draw it out, and it will become obvious.
We are given a curve, \(\large e'(w)\) in the \(\large w, e'\) coordinate system, for some weight \(\large w=w_{i,j}\). This error function is \(\large e = \frac{1}{2}(\hat z - z)^2\), and the weight \(\large w_{i,j}\) is deep inside the layers of \(\large \hat z\) in some compositional function \(\large s_j\) (or \(\large W_l\)) within the neural network. But it's in there, somewhere, and so \(\large e\) is also a function of \(\large w_{i,j}\), and it's that relationship, between \(\large e\) and \(\large w_{i,j}\) that we are interested in -- its slope, actually: \(\large e' = \frac{\partial e}{\partial w_{i,j}} \). In that relationship, we seek our target, the value of \(\large w\) at the zero-crossing point of the curve, \((w_t,e'(w_t))=(w_t,0)\). This target has coordinates \((w_t,0)\), because the error can no longer improve once we reach \(\large e'=0\).
However we can't depend on our first guess being close to \(\large w_t\) at the start; instead, we have some typically bad current estimate \(\large w\)), and this current estimate has coordinates \(\large (w, e')\), where \(\large |e'| \gt 0\) means the system has room to improve. We will find an improved estimate \(\large \hat w\) closer to \(\large w_t\), by drawing a line at the curve from our known point \(\large (w, e')\).
We now draw a tangent to our curve at our current estimate's coordinates, \(\large (w, e')\), and follow the tangent line down to its zero-intercept at \(\large (\hat w, 0)\). The tangent line kisses the curve at a point, \(\large (w, e')\), and is a linear approximation of the curve at the kissing point. This line defines a triangle with the horizontal axis where \(\large e'=0\), and the line's zero-crossing \(\large \hat w\); will be our improved estimate for \(\large w_t\). If we do this a few times we will get (increasingly rapidly) increasingly close to \(\large w_t\). So that's our picture.
Now the slope of the curve at the kissing point is the same as the slope of the tangent line, and we can hope to calculate this value, \(\large e''\), independently in both ways. Looking back at the triangle, we can relate all these numbers using first principles.
\( \large e'' = \frac{ e' - 0 } { w - \hat w } \) (because rise/run = slope).so
\( \large (w - \hat w) e'' = e'\ \Large ⌶\ \large w - \hat w = \frac{e'}{e''} \Large ⌶\ \large \hat w = w - \frac{e'}{e''} \\ \)(Notation reference.)
In charity to Newton (and less so to Raphson) this is called the Newton-Raphson update rule, but actually it's simply Successive Linear Approximation. I'll call it SLA.
Commonly, a few rounds of SLA are enough to produce several decimals of precision in estimating parameters of well-behaved functions. So if we can use it, it's worth a try. That's why.
Let me elaborate, for a deeper intuition. The key to the comparison between gradient descent and SLA is between local vs global optimums; maybe your local direction toward a local optimum isn't the direction toward the global optimum, and therefore you might hope to find a different way, along the way, by taking smaller steps. My point is, how irrational is that hope. Not impossible, but just a very small hope.
For gradient descent to do better than SLA, your problem's solution space would need a local optimum that's somehow local to your start point yet with the strange property that on your track toward it you fall off the side of a cliff in a different direction toward a global optimum. If that were the case, then taking only small steps would help you find a way to the better path, indeed, because you don't skip past the intersecting dropoff.
But wouldn't that global optimum be your actual local optimum, from the perspective of your starting point? It's certainly the endpoint of the steepest-slope path from the start point. Yes, this could imaginably happen if you are blindly hugging the sharp edge of a descending ridge, oblivious to the cliff right next to you, but at some point crossing over the edge. It requires a very special start point followed by a very special local pathway across a very specially shaped local surface to make this happen. By this reasoning, you will not lose many forking high-value paths by forgoing the small incremental steps for Isaac Newton's pretty-good estimate of the right distance in the right direction. Instead, you'll get there a lot more quickly and precisely. You choose.
Let me say it again.
If the direction toward local improvement is the direction you keep incrementally travelling and re-evaluating and travelling again, then going only a small step in that direction is hardly likely to suddenly open a new and also locally-optimal side-track path that now leads to the global optimum: you have to imagine a sort of pre-graded fork in the road, that happens to land in your path and slope you on a branching path to go that different and better way, away from your local path to your supposedly local optimum and instead slope you coincidentally on the forked path out toward the distant global maximum. I mean, how many of those you do expect in a random landscape and on a random path down the hillsides and valleys of it? Few to zero, is my guess.
Think about mountain paths, and how they would have to be to produce this. You'd need to find yourself, dropped randomly, high up on the upper surface of a downsloping surface or ridge which has a cliff, defining a falling-off edge alongside it, with also a lake inside the edge of the ridge which your locally-steepest-descent path points toward, but the cliff edge eats into the ridge to intersect your path toward the lake. The lake is the local optimum/minimum, which the upper surface of the ridge largely slopes toward, while the cliff offers a steeper path down toward a more global optimum. Further, your starting point atop the ridge must be close enough to the edge that despite steepest descent from there you would somehow by taking smaller steps fall off the cliff to find the path to the global maximum. Yes it could occur, but the number of these in actual mountains, I'm not sure I know any. There are plenty of lakes in the mountains, and local optima for unknown functions, but I have never walked down a bowl towards a lake and found myself in danger of falling off a cliff. Cliffs are rarely inside mountainous bowl-shaped featuress, rather the cliffs are above, and they slope down toward the lakes, from every point but up over the sharpest edge, and over there the slope sends you the other way, to a different local optimum, unless the valleys merge below. That's what I'd expect. A river valley, fine, we roll down to the bottom on steepest slope, then down the valley. A water fall at the bottom of some valley? We find our path gets suddenly steeper, that's fine too. SLA will solve those scenarios. But the one with the cliff edge cutting into a ridge above the lake, SLA won't solve; yet, I'm not impressed with the strong likelihood of encountering that in your normal neural network workday. First, they would have to exist in the space you're searching which here I suggest would be rare. And second, the number of random start points that would put you at just the right point on just such a ridge next to just such a cliff to follow just such a path toward the lake not the cliff, but by luckily small steps you find and fall off the cliff rather than skipping past, well, those seem to be proportionately even fewer.
By this intuition, I justify SLA over gradient descent.
Now, how do we do it?
\(\large e'_{i,j} = \frac{\partial e}{\partial w_{i,j}} = \frac{\partial s_j}{\partial w_{i,j}} \frac{\partial o_j}{\partial s_j} \frac{\partial e_{i,j}}{\partial o_j} = C_j B_j A_j \).where previously worked out are \(A_j\) (26.1), \(B_j\) (24), and \(C_j\) (21). Now consider \(\large e''\), defined with respect to a particular weight \(\large w_{i,j}\). We will apply the product rule, (PR), \( d(u*v) = v*du + u*dv\), twice:
\(\begin{align*} \large e''_{i,j} &= \frac{\partial ^2e}{(\partial w_{i,j})^2} \\ &= \frac{\partial}{\partial w_{i,j}} \frac{\partial e}{\partial w_{i,j}} \\ &= \frac{\partial}{\partial w_{i,j}} (A_j B_j C_j) \\ &= (A_j B_j) \frac{\partial C}{\partial w_{i,j}} + C_j \frac{\partial A_j B_j}{\partial w_{i,j}} \ \ \ \ \ (PR)\\ \end{align*}\).Simplify with C:
\( C_j = \large \frac{\partial s_j}{\partial w_{i,j}} = o_i \ \ \ \ \ (cf\ 21) \).So\( C_j' = \large \frac{\partial C_j}{\partial w_{i,j}} = 0 \ \ \ \ \ \ \ (26.2) \) (because \(o_i\) is independent of \(w_{i,j}\).)
\(\begin{align*} \large e''_{i,j} &= (A_j B_j)* 0 + C_j \frac{\partial A_j B_j}{\partial w_{i,j}} \ \ \ \ \ \ (cf\ 26.2,21) \\ &= o_i \frac{\partial A_j B_j}{\partial w_{i,j}} \\ &= o_i [ \frac{\partial B_j}{\partial w_{i,j}} A_j + B_j \frac{\partial A_j}{\partial w_{i,j}} ] \ \ \ \ \ \ (PR)\ \ \ \ \ (27) \\ \end{align*}\).To develop (27) further, we'll need to find these derivatives, using the chain rule (CR) as well as (PR).
\(\large B_j = \frac{\partial o_j}{\partial s_j} = o_j(1-o_j) \ \ \ \ \ \ (cf\ 24)\)Substituting back\(\large \large \frac{\partial B_j}{\partial w_{i,j}} = \frac{\partial B_j}{\partial s_j} \frac{\partial s_j}{\partial w_{i,j}} = o_i\frac{\partial B_j}{\partial s_j}\ \ \ \ \ \ (CR)\) (\(o_j = θ(s_j)\), so \(B_j\) is a function of \(s_j\); \(s_j\) is a function of \(w_{i,j}\).)
\(\begin{align*} \frac{\partial B_j}{\partial s_j} &= \frac{\partial }{\partial s_j} θ(s_j)(1-θ(s_j)) = \frac{\partial }{\partial s_j} o_j(1-o_j) \\ &= (1-o_j)\frac{\partial o_j}{\partial s_j} + o_j \frac{\partial}{\partial s_j}(1-o_j)\ \ \ \ \ \ (PR) \\ &= (1-o_j) B_j + o_j [0 - \frac{\partial o_j}{\partial s_j} ] \\ &= (1-o_j) o_j (1-o_j) + o_j [-B_j] \\ &= (1-o_j) o_j(1-o_j) - o_j [o_j(1-o_j)] \\ &= o_j(1-o_j) (1-o_j-o_j) \\ &= o_j(1-o_j) (1-2o_j) \\ \end{align*} \)
\(\begin{align*} \frac{\partial B_j}{\partial w_{i,j}} &= o_i o_j(1-o_j) (1-2o_j) \ \ \ \ \ \ (28) \\ \end{align*} \)Next see how \(\large \frac{\partial A}{\partial w_{i,j}} \) includes \(\large \frac{\partial A}{\partial o_j} \):
\(\begin{align*} \frac{\partial A_j}{\partial w_{i,j}} &= \frac{\partial A_j}{\partial o_j} \frac{\partial o_j}{\partial s_j} \frac{\partial s_j}{\partial w_{i,j}}\ \ \ (cf\ 25,CR) \\ &= \frac{\partial s_j}{\partial w_{i,j}} \frac{\partial o_j}{\partial s_j} \frac{\partial A_j}{\partial o_j} \\ &= C_j B_j \frac{\partial A_j}{\partial o_j} \\ &= o_i o_j(1-o_j) * \frac{\partial A_j}{\partial o_j} \ \ \ \ \ \ \ (29.1) \\ \end{align*} \)This bit, \(\large A'_j ≜ \frac{\partial A_j}{\partial o_j}\), takes up the rest of this section, a long page, to apply both the chain rule and the product rule, 3x each.
\(\begin{align*} A'_j = \frac{\partial A_j}{\partial o_j} &= \frac{\partial}{\partial o_j} \sum_k A_k o_k (1-o_k) w_{j,k} \\ &= \sum_k \frac{\partial}{\partial o_j} w_{j,k} o_k (1-o_k) A_k \\ &= \sum_k T_k \ \ \ \ \ (29.2) \\ \end{align*} \)Now the product rule gives us a term with \(\large \frac{\partial w_{j,k}}{\partial o_j}\), which will be convenient:
\(\begin{align*} T_k &= \frac{\partial}{\partial o_j} w_{j,k} o_k (1-o_k) A_k \\ &= \frac{\partial w_{j,k}}{\partial o_j} o_k (1-o_k) A_k + w_{j,k} \frac{\partial}{\partial o_j} o_k (1-o_k) A_k \ \ \ \ \ \ (PR) \end{align*} \)Since \(w_{j,k}\) is not used in the forward calculation of \(o_j\), it is independent of \(o_j\), so \( \large \frac{\partial w_{j,k}}{\partial o_j} = 0 \) and the first term drops out, leaving:
\(\begin{align*} T_k &= w_{j,k} \frac{\partial}{\partial o_j} o_k (1-o_k) A_k \end{align*} \)Applying the product rule a second time:
\(\begin{align*} T_k &= w_{j,k} [ (1-o_k) A_k \frac{\partial o_k}{\partial o_j} + o_k \frac{\partial (1-o_k)A_k}{\partial o_j} ] \end{align*} \) (30)Since \(o_k\) is a function of \(s_k\) which is a function of \(o_j\), the chain rule helps us with the first term:
\(\begin{align*} \frac{\partial o_k}{\partial o_j} &= \frac{\partial o_k}{\partial s_k} \frac{\partial s_k}{\partial o_j} = B_k\ w_{j,k} = o_k(1-o_k)w_{j,k} \end{align*} \) (31)And in the second term, the product rule gives us:
\(\begin{align*} \frac{\partial (1-o_k)A_k}{\partial o_j} &= (1-o_k)\frac{\partial A_k}{\partial o_j} + A_k\frac{\partial (1-o_k)}{\partial o_j} = (1-o_k)\frac{\partial A_k}{\partial o_j} - A_k\frac{\partial o_k}{\partial o_j} \end{align*} \) (32)To obtain \( \large \frac{\partial A_k}{\partial o_j} \) notice \(A_k\) is a function of \(o_k\), which is a function of \(s_k\) which is a function of \(o_j\), so the chain rule applies nicely again:
\(\begin{align*} \frac{\partial A_k}{\partial o_j} &= \frac{\partial A_k}{\partial o_k} \frac{\partial o_k}{\partial s_k} \frac{\partial s_k}{\partial o_j} = A'_k B_k w_{j,k} = o_k (1-o_k) w_{j,k} A'_k \ \ \ \ \ (34) \\ \end{align*} \)At this node k's level \(l+1\) in the neural network, we will have previously calculated \(\large A'_k = \frac{\partial A_k}{\partial o_k} \) for all the nodes at that level \(l+1\) during the backward propagation phase.
With these ingredients, then we can continue from (30) to finish \(T_k\):
\(\begin{align*} T_k &= w_{j,k} [ (1-o_k) A_k \frac{\partial o_k}{\partial o_j} &&+ o_k \frac{\partial (1-o_k)A_k}{\partial o_j} ] \ \ \ \ \ (cf\ 30)\\ &= w_{j,k} [ (1-o_k) A_k \frac{\partial o_k}{\partial o_j} &&+ o_k [ (1-o_k)\frac{\partial A_k}{\partial o_j} - A_k\frac{\partial o_k}{\partial o_j}\ ]\ ] \ \ \ \ \ (cf\ 32)\\ &= w_{j,k} [ (1-o_k) A_k o_k(1-o_k)w_{j,k} &&+ o_k [ (1-o_k)\frac{\partial A_k}{\partial o_j} - A_k o_k(1-o_k)w_{j,k}\ ]\ ] \ \ \ \ \ (cf\ 31)\\ &= w_{j,k} [ (1-o_k) A_k o_k(1-o_k)w_{j,k} &&+ o_k [ (1-o_k)o_k(1-o_k)w_{j,k}\frac{\partial A_k}{\partial o_k} - A_k o_k(1-o_k)w_{j,k}\ ]\ ] \ \ \ \ \ (cf\ 34)\\ &= (w_{j,k})^2 o_k(1-o_k)\ [\ (1-o_k) A_k &&+ o_k [ (1-o_k)\frac{\partial A_k}{\partial o_k} - A_k\ ]\ ] \\ &= (w_{j,k})^2 o_k(1-o_k)\ [\ A_k (1-o_k) - o_k A_k &&+ o_k (1-o_k) A'_k\ ] \\ &= (w_{j,k})^2 o_k(1-o_k)\ [\ (1-2o_k) A_k &&+ o_k (1-o_k) A'_k\ ] \ \ \ \ \ (35) \\ \end{align*} \)Combining (35) and (29.2) we get the key 2nd derivative term, \(A'_j\) which SLA backpropagation will successively calculate and save for each node, layer by layer from the final node(s) all the way to the inputs, as follows:
\(\begin{align*} A'_j = \frac{\partial A_j}{\partial o_j} = \sum_k T_k = \sum_k (w_{j,k})^2 o_k(1-o_k) [ (1-2o_k) A_k + o_k (1-o_k) A'_k\ ] = \sum_k (w_{j,k})^2 B_k [ (1-2o_k) A_k + B_k A'_k\ ] \ \ \ \ \ \ (36) \\ \end{align*} \)Returning to \(\large e''\) at (25):
Rearranging, we derive the ponderous:
\( \large e'' = \) \( o_i [ \) \( \frac{\partial B_j}{\partial w_{i,j}} \) \( * \) \( A_j \) \( + \) \( B_j \) \( * \) \( \frac{\partial A_j}{\partial w_{i,j}} \) \( ] \) ├─┤ ├─┤ ├──┤ (28) (24) (29.1) ├─────────┤ ├────┤ ├────────┤ \((o_i o_j (1-o_j) (1-2o_j))\) \(o_j (1-o_j)\) \(o_i o_j (1-o_j)\ * \) \(\frac{\partial A_j}{\partial o_j} \) ├─┤ (36) ├───────────────┤ \(\begin{align*} \sum_k (w_{j,k})^2 B_k [ (1-2o_k) A_k + B_k A'_k\ ] \end{align*} \)
\(\begin{align*} \large e'' = (o_i)^2 o_j (1-o_j)\ [\ (1-2o_j) A_j \ \ +\ \ o_j (1-o_j) \sum_k (w_{j,k})^2 B_k\ [\ (1-2o_k) A_k + B_k A'_k ]\ ] \\ \end{align*} \)which can be simplified by precalculating \(B_j=o_j(1-o_j)\) after the forward pass, and \(A_j\) recursively on the backward pass from the \(A_k\)'s per (25), and \(A'_j\) also recursively per (36):
\(\begin{align*} \large e'' = (o_i)^2 B_j [ (1-2o_j) A_j \ \ +\ \ B_j A'_j\ ]\ \ \ \ \ \ \ (37) \\ \end{align*} \)Almost done, we define \(\Delta_{i,j}\) for each \(w_{i,j}\) as
\(\large \Delta_{i,j} ≜ \large \frac{e'_{i,j}}{e''_{i,j}}\)which lets us combine (26.1) for \(e'\) and (37) for \(e''\):
\( \large \Delta_{i,j} = \frac{o_i B_j \sum_{k}(w_{j,k} B_k A_k)}{(o_i)^2 B_j [ (1-2o_j) A_j \ \ +\ \ B_j \frac{\partial A_j}{\partial o_j} ]} \)which simplifies, cancelling \(o_i B_j\) and replacing the upper sum \( \sum_{k}{w_{i,j}B_k A_k}\) by \(A_j\) (per 26.0), so that:
\( \large \Delta_{i,j} = \frac{A_j} {o_i [ (1-2o_j) A_j \ \ +\ \ B_j \frac{\partial A_j}{\partial o_j} ]} \)This brings us finally to the Successive Linear Approximation, a.k.a., Newton-Raphson, Backpropagation Update Rule:
\(\large \hat w_{i,j} = w_{i,j} - \Delta_{i,j} = w_{i,j} - \frac{A_j}{o_i [ (1-2o_j) A_j \ \ +\ \ B_j \frac{\partial A_j}{\partial o_j} ]} \),which is not only simpler to read, but now possible to calculate, given the forward-pass values: \( o_j, B_j\), and the backward-pass values: \( A_j, A'_j \).
It will save some steps if we calculate most of this before we get to the predecessor node \(i\). Let's define \(\Delta_j\) as:
\(\large \Delta_{j} ≜ \frac{A_j}{[(1-2o_j) A_j \ \ +\ \ B_j A'_j\ ]}\)So that
\(\large \Delta_{i,j} = \frac{1}{o_i} \Delta_{j}\)We can now state our SLA Backpropagation Algorithm:
Forward Base Case:Repeat the above for any subset(s) of the training data you want to learn from in a single generation.
Same as previously given.Forward Recursion:
Same as previously given.
Save \(o_j, B_j\) for all nodes.
(Don't keep the \(s_j\)'s, once the \(o_j\)'s are done.)
(If a subtract-and-multiply is cheaper than retrieval from memory, calculate \(B\)'s as needed below.)Backward Base Case:
For each node \(j\) in level \(L\) (i.e., final layer, output node(s))
\(A_j = \frac{\partial e}{\partial \hat z} = \hat z - z\) (cf 15)
\(A_j' = \frac{\partial^2 e}{(\partial \hat z)^2} = 1\) (cf 15.2)Backward Recursion:
For each level \(l\) from \(L-1\) down to \(0\)
For each node \(j\) in level \(l\)
\(B_j = o_j(1-o_j)\) (skip if saved above, but save anyway for the next layer)
\(A_j = \sum_{k}w_{j,k} B_k A_k \ \ \ \ \ \ (cf\ 26.0)\)
\( A_j' = \sum_{k} (w_{j,k})^2 B_k [ (1-2o_k)A_k + B_k A_k'\ ] \ \ \ \ \ \ (cf\ 36) \\ \)
\( \large \Delta_{j} = \large \frac{A_j}{[\ (1-2o_j)A_j + B_j A'_j\ ]} \)
For each node \(i\) in level \(l-1\)
\( \large \Delta_{i,j} = \large \frac{\Delta_j}{o_i} \)*
* Use += instead of = to accumulate adjustments across multiple observations. (Reset to zero after updates.)Update Rule:
For each level \(l\) from \(L-1\) down to \(0\)
For each node \(j\) in level \(l\)
For each node \(i\) in level \(l-1\)
\( \large \hat w_{i,j} = w - \Delta_{i,j} \)
Repeat generations until the adjustments get as small as you like.
Now you have trained your Neural Network. Tadaa!
I hope you will find this process to require many fewer iterations than Gradient Descent. Let me know!
For more, have a look at Fuzzy Logic NNs and Robot Control: