Given high school algebra and a little trigonometry and derivative rules, plus some patience, you can here learn the ideas and math behind neural networks, plus improvements and intuitions about them which I haven't seen elsewhere. Please take notes and enjoy the ride, and the view at the top!
Equality is indicated horizontally by =, vertically by ├─┤ (pronounced "H"), definitionally by ≜, and horizontally between expressions or equations by ⌶.
I especially hope you will use H Notation which considerably condenses and simplifies homework and mathematical (and linguistic) derivations. Reduce the pointless rewriting of long equations, and replace the sloppy scribbling of brackets and braces and circles and arrows. H notation provides a tidy, specific, precise tool to make your meaning clear and exact. It says the above part is equal to, and can be replaced by, this other part below. Please use H Notation, to help yourself as well as your readers and students to easily follow even complex derivations.
More:
Evaluation: \(\big|^c\) indicates evaluation under \(c\), one of multiple conditions.
Elsewhere: indicates the "elsewhere" or "otherwise" condition. For example, \(max(a,b)= \big|^{a>b} a,\ \ \big|\) \(b\). Sweet and simple, and fun to draw.
Last, function composition is normally represented by ∘, pair-wise, (but let me add) by ⦾, list-wise.
(1) f ∘ g (x) ≜ g(f(x)).
Both evaluation orders, as-read-first and nearest-first order (inside out), are used and found intuitive by different people. To choose explicitly, let us use reading order, taking inner to outer as being represented on the left and right sides of the ∘ symbol, respectively. Also, let composition take associative priority over function application, so that: f ∘ g (x) = (f ∘ g)(x).I will let ⦾ represent (ordered) function composition of its argument list, like ∑ and ∏ represent (unordered) summation and product.
These list-wise operators are one way to demonstrate that some kind of a truth, some actual generalization, has been found, because instead of an enumeration of cases, one must have found the generating expression which captures all the cases.Given \(B\) functions αb, \((0\le b\lt B)\), define a composition C
It is important to capture an idea, insight, or observation, at the correct level of abstraction, An enumeration is less, a generating expression more, abstract.
because otherwise you haven't really understood what you think you know.
indicating
(2.1) C ≜ ⦾ αb b
Using ⦾ we can define an entire neural network in four symbols: ⦾ W ∘ \(\theta\).
(2.2) C(x) = α0 ∘ α1 ∘ α2 ∘ ... αB-1 (x) = αB-1( ... α2( α1( α0(x)))...)
So I'll often let you figure out the indexes for the argument list. The index range may be inferrable from context, as in \(\sum_e\), or by set membership rather than a sequence of numbers. For example, \(\large\sum\)\(e \in S\) lets you enumerate later (if you ever need to write it as code): just get all the values \(e\) in the set \(S\). Too much obsession about the indexes when you're trying to capture the simple essence of something is distracting and defeats the weak mind of this author.
Generally, zero-based enumeration, 0...N-1, for the N elements in a list, does help clear thinking, as I learned from the great George Mou's Yale PhD thesis, A Formal Model for Parallel Processing. He uses index-renumbering functions which take a subset of indexes of a larger set, and renumbers them from 0 to N-1, so that within the discussion of the subset you don't have to keep track of where you are in the superset. Then you can refer to the indexes of members of a subset as counting from 0 to N-1. Easy!
Also, I find it quite helpful to use variable names that are evocative and mnemonic: \(r\) for row number, \(a\) to \(z\) represent inputs to outputs, \(l\) for layer, \(w\) for weight, \(E\) and \(e\) for error, \(s\) for sum, \(o\) for (node) output, \(C\) for a composition. Also, I associate in my mind, \(i\), \(j\), and \(k\), with nodes in the previous, current, and next layers of the network. respectively, so \(j\) refers to any node in the current level \(l\), and \(\large \sum_i\) refers to all nodes \(i\) in level \(l-1\), etc. It's easier to think about complex matters when the variable names are more meaningful. Still, some variable names merely came to hand when needed, \(\alpha\), \(B\), \(A\), \(F\), \(T\).
Let Cn be the nth derivative of C, \( α^n_b \) the nth derivative of \( α_b\), etc. Hence C0 = C.
In some cases where n is 1 or 2, I will also write \(\large e'\) for \(\large \frac{\partial e}{\partial w_{i,j}}\), \(\large e''\) for \(\large \frac{\partial^2 e}{(\partial w_{i,j})^2}\), and A' for \(\large \frac{\partial A}{\partial o_s}\).
(10) \( \large \frac{d}{dx} g(f(x)) = \frac{\partial g}{\partial f} * \frac{df}{dx} \)Repeatedly applying the chain rule to a repeated composition C (setting α-1 ≜ x) yields its first derivative C1:
(11) C1(x) ≜ \( \large \frac{d}{dx} C(x) = \frac{d}{dx} \) ⦾ \( \large \alpha_b (x) = \) \( \large\prod \) \( \large \frac{\partial \alpha_{b}}{\partial \alpha_{b-1}} \) (x) b b
Or in fewer symbols:
(3) C1 = ( ⦾ \( α_b \) )1 = ∏ \( α_b ^1 \) \( \ \ \ \ \ \ \ (CR) \) b b
Let's be careful about what we can do with graphs here, and let's consider the cases for a focus node B. Our reasoning should apply to any node, with zero, one or more predecessors, and zero, one or more successors.
The function composition relationship is that the output of the inner function is the input of the outer function. A feeds B. B(A(x)) is the composition of B over A.
If P's input is x and S's output is y, then y=S(P(x)). The chain rule asserts further that one may separately differentiate, then multiply, the parts of a composition. Holding constant any other inputs to P:
\(\large \frac{dy}{dx} = \frac{dS}{dx} = \frac{dS}{dP} * \frac{dP}{dx}\)This lets us build a (so far non-branching) graph of forward links, in which the parts are separately differentiable, and where multiplying the derivatives gives the cumulative derivative across a subsequence of links.
For backward calculation of derivatives we hold all the inputs constant but one, and differentiate with respect to that one, using the normal chain rule method. And if we can do one, we can do them all the same way.
But generally we can get multiple outputs by using multiple nodes each with a single output. So much for multiple outputs.
But consider multiple output links from a node. These either represent multiple outputs, discussed above, or a single output from a node, taken as the input to multiple successors. In the latter case, forward calculations take the single output of a node and just copy or share it to be fed as input to each of perhaps many successor: simply a no-op. We have handled all the easy cases; the one which remains, the backward, differentiation process, is where we have a problem.
Suppose we are doing backward propagation of the differentiation process, layer by layer across the network from the end toward the beginning. Suppose we encounter a node with multiple successors. We have to take the derivative of forward-expanding node. Let's say we have node A with its own input x, and feeding at least two successors, say B and C. Locally, B(A(x)) and C(A(x)) is the local compositions of functions. Differentiating backward, we might guess that the derivative of multiple successors is the sum of multiple derivatives. This is consistent with the idea of Total Derivative in which if \(z = f(x(t),y(t))\), then the total derivative
\(\large \frac{dz}{dt} = \frac{df}{dt} = \frac{df}{dx} \frac{dx}{dt} +\frac{df}{dy} \frac{dy}{dt}\).In English, the derivative from multiple successor paths is the sum of the chained derivatives along each successor path.
By what intuition should this be the case for neural networks? We seek to find how an error at the end of the forward calculation varies with respect to the different parameters of the model. \(\large \frac{de}{dp}\) for some parameter \(p\). One node's output is \(t\); it is taken as input by multiple nodes such as \(x\), and \(y\); in the end the system output is a function of both those paths, \(f()\), and we want to know what the derivative of the error is with respect to \(t\), when \(t\) feeds into the error by multiple different functions of \(t\).
Another way to think of it is, if \(t\) changes it drives a change in \(e\) via the path \(x\), and drives a further separately calculable change in \(e\) via the path \(y\), and we might approximate the joint effects through both paths \(x\) and \(y\) on \(e\) as the sum of the two effects through the two separate paths. It might be an overestimate, but if it approaches zero effect via each path separately then we might expect that the joint effect would also tend toward zero. Anyway on that intuition and the theory of Total Derivatives, we can proceed.
All Hail the Chain Rule!
This identity between flow of information across nodes in a graph and function composition is the key idea in neural networks. Then with the chain rule we can now go beyond just calculating outputs layer by layer, but by the chain rule, derivatives layer by layer, which is what lets us train them, by seeking those zero-slope parameter values where they can't be improved any more.Great. Remember the chain rule.
{\( n, L, W, \theta, e \)}
where
We are given a train/test dataset R (Rows), mapping from a to z, for \(0\le r\lt|R|\)
(4) R[r] ≜ (ar, zr)so that each observation (row) is an observed input ar, associated with an observed (target) output zr.
f's output, which we hope will be close to \(z_r\), is calculated as the estimate \(\hat z_r\), as follows:
A Neural Network is a function f: \( \newcommand{\COMPOP}{\Huge ⦾} \) $$\large f = \underset{\large l}{\COMPOP} \large W_l \circ \theta $$ |
considering each Wl as its matrix multiply function. In short, a neural network is a composition of weightings and thresholdings, layer by layer, from the inputs to the final output.
L (5) ẑr ≜ f(ar) ≜ ⦾ Wl ∘ θ (ar) l = 0
Concise is nice, but clear is best. So let's restate with more concreteness and perhaps clarity.
By definition of ⦾,
(6) ẑr = W0 ∘ θ ∘ W1 ∘ θ ... ∘ WL ∘ θ (ar) = WL( θ( ... W2( θ( W1( θ( W0(ar))))) ...))
(I won't mention matrices again.)
This composition of layers in Neural Networks lets us calculate f by proceeding layer by layer across the network, calculating intermediate variables \(s_j\) (sum) and \(o_j\) (output) for each layer l, as follows:
SUM:
\(s_j ≜\sum_i w_{i,j} * o_j\)OUTPUT: \(o_j≜\theta(s_j)\) |
(7) \(s_j\) ≜ ∑i \(w_{i,j}\) * oi ("sum")and
(8) \(o_j\) ≜ θ(\(s_j\)) ("output")Composition of layers also justifies using the chain rule of differentiation to calculate derivatives with respect to the \(w_{i,j}\) ∈ W, also layer by layer, but from the last layer toward the first.
All these are intuitive, but my contribution here is the gas gauge or
"meter symbol", which I find especially helpful in drawing out
dependencies so I can visualize how to apply the chain rule
correctly. A meter symbol correctly suggests a range from empty to
full, 0 to 100%; it helpfully separates the sum s from the
output o, and it even looks a little bit like the θ that
it represents.
If you like this kind of view, then behold, for your mere enjoyment, a completely imaginary 2-hidden-layer binary classifier, where ẑ>0.5 means Yes, otherwise No:
In this invented machine each system-input ar is a vector
of three input values, which are directly made over to be the outputs of
three nodes in layer 0. Each (subsequent) layer
calculates \(s_j\) using equation (7) then \(o_j\) using equation (8),
for each node in that layer.
As the designer you get to pick which nodes have weights between them, and which don't. Fully connected layers, between adjacent layers, is just one design. Whatever, the algorithms don't care. The action is all in the weights that are not zeroes. Any zero weight will stay zero, is all you need to know.
To keep track of the calculations in the giant potential mess this can become, where subscripts and indices and layers are variables that increment across the network, it helps me to have a completely generic instance in mind, with established index names which have meaning in this generic case. Then I can minimize the mental effort of getting the indexes right, which can add so much noise as to limit clear thinking. Generic, but not chaotic, helps me achieve clarity.
To visualize the information flows, I apply these graphical conventions together in the simplest, but fully general, case of a single arbitrary node, j, at an arbitrary interior level, \(0\lt l\lt L\). I am being careful to say nothing that reduces the generality of this picture, so that we can apply it to every node in the network, and therefore we do actually understand the entire process comprehensively, at least from the bottom up, and from one end to the other and back again.
So: Consider an arbitrary node, j, in an arbitrary level, l, along with an arbitrary predecessor of j, numbered i, in level l-1, and an arbitrary successor of j, numbered k, in level l+1. In my notation here, k's live on level l+1, so for example \(\sum_k\) means a sum over all the *successors* of the given j. The forest of weights is hidden except for single weights in and out of node j, by way of example.
The last layer's output value \(o_z\) is the function output f(ā) = ẑ.
There may be one or more nodes in the final layer, producing a vector or list for ẑ; which makes no difference except that your target outputs must also be organized the same way, with the same number of elements in z as the final layer has nodes, and that you will end up with different weights for the different end nodes. You will calculate -- using the same methods -- up to multiple endpoints. If you can do it for one, you can do it for them all.So the essence is simple: Multiply, sum, threshold. Applying this across a whole network is the entire concept.
So now also in the visual domain, you have the idea. That's it!
What's left is to put this idea that you already know into explicit instructions that you could write code from, and then to understand how to train these things.
(9) f(ar) = ẑ ≜ \(o_j\), j ∈ LLemerges after iterating over layers from L0 to LL, until finally j ∈ LL.
At each layer, we calculate intermediate results \(s_j\) and \(o_j\) for each j ∈ Ll.
The sequence is:
Base case:At the end, \(f(a_r) = \hat z_r = o_j\) where \(j \in L_{L}.
For j ∈ L0 set \(o_j\) = ar[j] (inputs)Recursion:
For l ≔ 1 to L:
For j ∈ Ll
\(s_j = \sum_i \in L_{l-1} o_i w_{i,j}\) and
\(o_j = \theta(s_j)\)
Retain with each node the local value of \( o_j\), since it will also be used later in training.
Standard in NN training is to use (11) (with b=l and αb= Wl ∘ θ). (11) is the reason neural networks work. But it's slow. We adjust the weights using a fixed multiple of the direction \(\large e'\), instead of using Newton-Raphson which needs the second derivative C2 and which would let us at least do a linear approximation for an actual reasoned next estimate for a minimum-error setting of the weights. That ought to be many times faster -- depending on your step size -- as well as more precise.
A discussion of why Newton-Raphson should normally be much better than gradient descent, along both a geometrical and a mathematical derivation of the Newton Update, is here.
To be better, we must improve it, and the improveable part of a neural network is its weights. To improve the weights to get a better, trained, neural network model, we must adjust them from their original, possibly random or uniform but anyway far-from-ideal, untrained values.
How do we adjust them? In the direction of reducing errors. So, direction is the key concept. And we figure out what direction by using derivatives.
So we seek to minimize total error, E:
(13) E ≜ ∑rer.With reference to any training row R[r] = (ar, zr) (leaving subscripts r unwritten), let:
\( \large e ≜ \frac{1}{2}(ẑ - z)^2 \ \ \ \ \\ \large \frac{\partial e}{\partial \hat z} = \hat z - z \ \ \ \ \\ \large \frac{\partial^2 e}{(\partial \hat z)^2} = 1 \) |
(14) e ≜ \( \frac{1}{2}\)(ẑ − z)2whence:
and
(15) \( \large \frac{\partial e}{\partial \hat z} \) \( \large = \frac{\partial }{\partial \hat z} \frac{1}{2} (\hat z - z)^{2} \) \( \large = \frac{2}{2}(\hat z - z)^1 \frac{\partial }{\partial \hat z} (\hat z-z) \) \( \large = (\hat z - z)*1 \) \( \large = \hat z - z \)
We can add these up across some or all the training examples, r, because the derivative of a sum is the sum of the derivatives.
(15.2) \( \large \frac{\partial^2 e}{(\partial \hat z)^2} \) \( \large = \frac{\partial }{\partial \hat z}(\hat z - z) \) \( \large = 1 \)
dE = d(∑e) = ∑deNow (15) is the error derivative with respect to the output ẑ. That's the beginning of the story (starting from the end), but now we want to find the slope or derivative \( \large e' \) of the error with respect to each weight, and adjust the weight by a negative multiple of that derivative (negative in order to move downslope, to reduce the error).
Gradient Descent: |
(16) \(e'\) ≜ \( \large \frac{\partial e}{\partial w_{i,j}} \).which turns out equal to the Newton-Raphson update rule if we specify
(17) \( \large \hat w_{i,j} = w_{i,j}-\delta e' \) (Gradient descent update rule)
\(\large \delta = \frac{1}{e''}\)So we will seek not just \(\large e'\), for gradient descent in the right direction, but also \(\large e''\) to estimate the right distance to go.
Newton-Raphson: |
(18) \(\large e''\) ≜ \( \large \frac{\partial ^2e}{(\partial w_{i,j})^2} \)Our task here will be to derive these derivatives. Haha!
(19) \( \large \hat w_{i,j} = w_{i,j}-\frac{e'}{e''} \) (Newton-Raphson update rule)
There will be a ton of them, \(e'_{i,j}, e''_{i,j}\), one for each weight \(w_{i,j}\), but we will calculate them all, and could update all the weights, in a single backward pass.
\(\large A_j, B_j, C_j\) are derived as follows (taking easiest first):
\( \large e'_{l,i,j} = \frac{\partial e}{\partial w_{i,j}} = \) \( \large \frac{\partial e}{\partial o_j} \) * \( \large \frac{\partial o_j}{\partial s_j} \) * \( \large \frac{\partial s_j}{\partial w_{i,j}} \) (20) ├──┤ ├───┤ ├───┤ Aj Bj Cj ├──┤ ├───┤ ├───┤ See
below\( o_j(1-o_j)\) oi
\(\large C_j=o_i\) |
\( \begin{align*} \large C_j = \frac{\partial s_j}{\partial w_{i,j}} = \frac{\partial}{\partial w_{i,j}} \sum_{i'} w_{i',j}o_{i'} = \sum_{i'} o_i|_{i'=i}, 0|_{i'\ne i} = o_i \ \ \ \ \ \ (21) \end{align*}\)
\(\large θ(s)=\Large \frac{1}{1+e^{-s}}\) |
\(\large θ(s)=\frac{1}{1+e^{-s}} \ \ \ \ \ \ (22) \)and has a convenient derivative
\(\large \frac{\partial θ(s)}{\partial s} = θ(s)(1-θ(s)) \ \ \ \ \ \ \ (23) \)giving us for \(\large B_j\):
\(\large B_j=o_j(1-o_j)\) |
\( \begin{align*} \large B_j = \frac{\partial o_j}{\partial s_j} = \frac{\partial θ(s_j)}{\partial s_j} = θ(s_j)(1-θ(s_j)) = o_j(1-o_j) \ \ \ \ \ (24) \\ \end{align*} \)Our formulas will make much use of \( B_j = \frac{\partial o_j}{\partial s_j} = o_j(1-o_j) \).
Now \(\large A_j\) is the key to backpropagation; it is defined first at the end for final node(s), then at each preceding level it is built up using its previously calculated values from the next later level. The base case of this recursion is with j final:
Then for all the non-final cases, the recursion must add up the effects from each successor to j, that is, for each k ∈ Ll+1 a successor of j (we always assume j ∈ Ll), as follows:
\( \large A_j = \frac{\partial e}{\partial o_j} = \frac{\partial e}{\partial \hat z} = \hat z - z \ \ \ \ \ (cf\ 15) \)
Where:
\(\large A_j = \frac{\partial e}{\partial o_j} = \) \(\large \sum_{k} \) \( \large \frac{\partial e}{\partial o_k} \) * \( \large \frac{\partial o_k}{\partial s_k} \) * \( \large \frac{\partial s_k}{\partial o_j} \) (25) (by the chain rule) ├──┤ ├──┤ ├──┤ Ak Bk F
Hence
\( \large A_k\): was already calculated in layer l+1. \( \large B_k = o_k(1-o_k)\) Just as for \(\large B_j\).
\( \large F = \frac{\partial s_k}{\partial o_j} \) \(\large = \frac{\partial}{\partial o_j} ( \sum_{j'} o_{j'} w_{j',k} ) \)
\(\large = \sum_{j'} w_{j,k}|j'=j, \ \ \ \ 0|j'\ne j. \) \(\large = w_{j,k} \)
\(\large A_j=\sum_{k} w_{j,k} B_k A_k\) |
\(\begin{align*} \large A_j = \frac{\partial e}{\partial o_j} &= \sum_{k} F*B_k*A_k \ \ \ \ \ \ \ \ \ = \sum_{k} w_{j,k}\ o_k (1-o_k) \frac{\partial e}{\partial o_k} \\ &= \sum_{k} w_{j,k}\ o_k (1-o_k) A_k \ \ \ \ \ \ \ (26.0) \\ \end{align*}\).Substituting back into (20):
\(\large e' = o_i B_j A_j\) |
\(\begin{align*} \large e'_{i,j} &= \frac{\partial e}{\partial w_{i,j}} = \large C_j B_j A_j = C_j B_j \frac{\partial e}{\partial o_j} = C_j B_j \sum_{k} (F B_k A_k) \\ \large e' &= \large o_i o_j (1-o_j) \sum_{k}(w_{j,k} o_k (1-o_k) A_k) \ \ \ \ \ \ \ \ (26.1) \\ \end{align*} \)
We build the computation graph according to the following types of flows, with each node's intermediate values of s's and o's, etc., each making use of previous calculations.
Feed-forward | Back-propagate |
├──────┤ | ├─────────────┤ |
a → o → s → ẑ | → Ak ∈ Ll → Aj ∈ Ll−1 → .. → AL0 |
→\(\large e'_{l,i,j} \) → \(w_{i,j}\) |
Parallel computation can be achieved by dividing the training data according to the number of available processors and separately calculating \(\large \Delta_{i,j}\)'s or \(\frac{\partial e_r}{\partial w_{i,j}}\) for each \(e_r\), and then combining the results when completed by summing the Δ's across training data subsets.
Alternatively, processors can be assigned to subsets of the network, to make feedforward or backward calculations, as the data becomes available to their respective subset of the network. The results must be made available to the successor processors, keeping results separate for each observation. Then inputs can be processed both forward and backward not so much simultaneously as overlappingly, in pipelined, first-in-first-out fashion.
(Formal analysis of the algorithm will benefit from the concepts in George Mou's 1991 Yale CS thesis on parallel computation, for example the divide/combine index operations.)
Proceeding layer-by-layer this adjustment indicator propagated from the final layer all the way to the start layer, leaving behind evidence, that is, de/d\(o_j\), at each node j, so as to calculate the new value for \(w_{i,j}\) The update process follows the existing network structure: first as to computation flow; second by leaving the zero weights unchanged. Thus the network shape remains unchanged, the number of nodes at each layer remains unchanged, only the weights adjust.
The updates are scaled:
0 ≤ \(o_j\) ≤ 1from which it follows that
0 ≤ \(o_j\)2 ≤ \(o_j\), and
0 ≤ \(o_j\)(1−\(o_j\)) = \(o_j\)−\(o_j\)2 ≤ \(o_j\).
It always helps to contemplate a graph of the relevant functions.
Define three degrees of uncertainty to apply to a passed-down error-derivative signal, as follows:
SortOf(j) ≜ \(o_j\) SortOf ∘ SortOf(j) ≜ \(o_j\)2 Maybe(j) ≜ Bj = \(o_j\)−\(o_j\)2 = SortOf(j) − SortOf(SortOf(j)) UpError(k) ≜ de/dok
de/d\(w_{i,j}\) = | de/d\(o_j\) | * | d\(o_j\)/d\(s_j\) | * | d\(s_j\)/d\(w_{i,j}\) | = | C | * | B | * | A |
├──┤ | ├───┤ | ├───┤ | ├───┤ | ├───┤ | ├───────────┤ | ||||||
A | B | C | SortOf(i) | Maybe(j) | ∑kwj,k*Maybe(k)*UpError(k) | ||||||
├───┤ | ├───┤ | ||||||||||
\(o_j\)*(1−\(o_j\)) | oi |
Thus the update signal Δ at a node j at level l is a diluted sum of the backpropagating error signals UpError(k) from successor nodes at level l+1, scaled up by the wj,k weights, but diluted by enhanced uncertainties on the links backward from different nodes k to the node j, and by enhanced uncertainty on the node j output \(o_j\), but only by the inherent uncertainty of the output of node i. Hence:
Δi,j = SortOf(i) * Maybe(j) * ∑k wj,k * Maybe(k) * UpError(k).The certainty at layer l−1 is left alone, but at layers l and l+1 it is enhanced by squaring the uncertainty, and then combining these for all successors k of the i,j link, we make our estimate, Δ of the backpropagating error direction. Adjusting all the \(w_{i,j}\)'s in the opposite direction, to reduce the errors, with a sequence of run-throughs of the training data set, is how backpropagation brings increased overall certainty to its mapping of the chaos of the training set to something close to the desired outputs f(ar).
The Dunning Kruger effect is where a person who doesn't know, thinks he knows (more often a "he", indeed). Ignorance combined with not wanting to learn, would be an even worse combination. Backpropagation seems to follow an Anti-Dunning-Kruger approach.
When the error-slopes and errors are small or the certainty levels low, then the pre-existing mass of the weights at later layers move more slowly toward the earlier layers. On the other hand when the certainty levels are high and the errors and error-slopes are larger, then the overall mass of the later weights move most, and most quickly, toward the earlier layers. Thus learning is accelerated by high certainties in the feedforward analysis combined with a large error-derivative, defining across the weights taken as a whole a definite direction to improve towards.
This is like curiosity combined with intellectual humility: if I really thought I had it right, but I was wrong, that's when I most need to learn, and indeed I have the greatest opportunity to learn.
These factors of error magnitude and (forward) analysis certainty operate in a trading relationship, but the general idea that for re-estimating a weight \(w_{i,j}\) we leave alone, or incorporate directly, the degree of uncertainty taken from oi to be part of our re-estimations; whereas, we enhance the uncertainty (by subtracting its square) from the later outputs \(o_j\) and ok. Generally, the learning process in this system moves the mass of the later-layer weights wj,k into the earlier-layer weights \(w_{i,j}\), taking careful account of the uncertainties presently encoded, but moving more where the errors are larger.
I hope this helps you to have an intuitive understanding of the backpropagation concept.
Any comments are welcome and much appreciated.
form outputs here. in comment_form() with from=bottom, uri=nn/nn.php