Given high school algebra and a little trigonometry and derivative rules, plus some patience, you can here learn the ideas and math behind neural networks, plus improvements and intuitions about them which I haven't seen elsewhere. Please take notes and enjoy the ride, and the view at the top!

Notation

Consider adding these bits of Veatch notation to your armamentarium.

Equality is indicated horizontally by =, vertically by ├─┤ (pronounced "H"), definitionally by ≜, and horizontally between expressions or equations by ⌶.

I especially hope you will use H Notation which considerably condenses and simplifies homework and mathematical (and linguistic) derivations. Reduce the pointless rewriting of long equations, and replace the sloppy scribbling of brackets and braces and circles and arrows. H notation provides a tidy, specific, precise tool to make your meaning clear and exact. It says the above part is equal to, and can be replaced by, this other part below. Please use H Notation, to help yourself as well as your readers and students to easily follow even complex derivations.

Evaluation: $\big|^c$ indicates evaluation under $c$, one of multiple conditions.

Elsewhere: indicates the "elsewhere" or "otherwise" condition. For example, $max(a,b)= \big|^{a>b} a,\ \ \big|$ $b$. Sweet and simple, and fun to draw.

Last, function composition is normally represented by ∘, pair-wise, (but let me add) by ⦾, list-wise.

(1) f ∘ g (x) ≜ g(f(x)).

Both evaluation orders, as-read-first and nearest-first order (inside out), are used and found intuitive by different people. To choose explicitly, let us use reading order, taking inner to outer as being represented on the left and right sides of the ∘ symbol, respectively. Also, let composition take associative priority over function application, so that: f ∘ g (x) = (f ∘ g)(x).

I will let ⦾ represent (ordered) function composition of its argument list, like ∑ and ∏ represent (unordered) summation and product.

These list-wise operators are one way to demonstrate that some kind of a truth, some actual generalization, has been found, because instead of an enumeration of cases, one must have found the generating expression which captures all the cases.

It is important to capture an idea, insight, or observation, at the correct level of abstraction,
because otherwise you haven't really understood what you think you know.

An enumeration is less, a generating expression more, abstract.

Given $B$ functions α_b, $(0\le b\lt B)$, define a composition C

(2.1) C ≜ ⦾ α_b

b

indicating

(2.2) C(x) = α₀ ∘ α₁ ∘ α₂ ∘ ... α_B-1 (x) = α_B-1( ... α₂( α₁( α₀(x)))...)

Using ⦾ we can define an entire neural network in four symbols: ⦾ W ∘ $\theta$.

Hints and Indexes

Evocative or meaningful variable names help to think clearly in a complex space. And indexes for a list operator can often be hinted at or expressed as a set rather than a sequence of numbers, to keep the math simple.

So I'll often let you figure out the indexes for the argument list. The index range may be inferrable from context, as in $\sum_e$, or by set membership rather than a sequence of numbers. For example, $\large\sum$_{$e \in S$} lets you enumerate later (if you ever need to write it as code): just get all the values $e$ in the set $S$. Too much obsession about the indexes when you're trying to capture the simple essence of something is distracting and defeats the weak mind of this author.

Generally, zero-based enumeration, 0...N-1, for the N elements in a list, does help clear thinking, as I learned from the great George Mou's Yale PhD thesis, A Formal Model for Parallel Processing. He uses index-renumbering functions which take a subset of indexes of a larger set, and renumbers them from 0 to N-1, so that within the discussion of the subset you don't have to keep track of where you are in the superset. Then you can refer to the indexes of members of a subset as counting from 0 to N-1. Easy!

Also, I find it quite helpful to use variable names that are evocative and mnemonic: $r$ for row number, $a$ to $z$ represent inputs to outputs, $l$ for layer, $w$ for weight, $E$ and $e$ for error, $s$ for sum, $o$ for (node) output, $C$ for a composition. Also, I associate in my mind, $i$, $j$, and $k$, with nodes in the previous, current, and next layers of the network. respectively, so $j$ refers to any node in the current level $l$, and $\large \sum_i$ refers to all nodes $i$ in level $l-1$, etc. It's easier to think about complex matters when the variable names are more meaningful. Still, some variable names merely came to hand when needed, $\alpha$, $B$, $A$, $F$, $T$.

Let Cⁿ be the nth derivative of C, $ α^n_b $ the nth derivative of $ α_b$, etc. Hence C⁰ = C.

In some cases where n is 1 or 2, I will also write $\large e'$ for $\large \frac{\partial e}{\partial w_{i,j}}$, $\large e''$ for $\large \frac{\partial^2 e}{(\partial w_{i,j})^2}$, and A' for $\large \frac{\partial A}{\partial o_s}$.

The Chain Rule is Everything

The Chain Rule (CR) says the derivative of a composition is the product of the derivatives.

(10) $ \large \frac{d}{dx} g(f(x)) = \frac{\partial g}{\partial f} * \frac{df}{dx} $

Repeatedly applying the chain rule to a repeated composition C (setting α_-1 ≜ x) yields its first derivative C¹:

(11) C¹(x) ≜ $ \large \frac{d}{dx} C(x) = \frac{d}{dx} $ ⦾ $ \large \alpha_b (x) = $ $ \large\prod $ $ \large \frac{\partial \alpha_{b}}{\partial \alpha_{b-1}} $ (x)

b b

Or in fewer symbols:

(3) C¹ = ( ⦾ $ α_b $ )¹ = ∏ $ α_b ^1 $ $ \ \ \ \ \ \ \ (CR) $

b b

Graphical Chain Rule

The chain rule is why we can build a computation flow graph. Each node is just a function that takes its inputs to its outputs, but those inputs are themselves the results of functions at previous layers. Stepping along the graph, at each node you compose a function on top of the outputs of the preceding nodes. Each link in the graph represents an additional layer of composition. And each node extends the various compositions that have been built up to the preceding nodes by its own function which combines those nodes' outputs. Therefore each node is not just a locally-calculated function of any old inputs, everything thrashing uncoordinatedly on its own, No: it is an organized, global, or rather feedforward, summary or transformation of the whole process so far up to that point, because each one is a composition of all its predecessors.

Let's be careful about what we can do with graphs here, and let's consider the cases for a focus node B. Our reasoning should apply to any node, with zero, one or more predecessors, and zero, one or more successors.

The function composition relationship is that the output of the inner function is the input of the outer function. A feeds B. B(A(x)) is the composition of B over A.

1) If B has no predecessor then it is an initial node.
2) If B has no successor then it is a final node.
3) Consider any graphical link forward from one node to another. Without loss of generality we have predecessor P, successor S, and the forward link PS, indicating that an input to the function associated with S, call it S(), is the output of P, call it P(). Nodes and link, P,PS,S represent a function composition.
If P's input is x and S's output is y, then y=S(P(x)). The chain rule asserts further that one may separately differentiate, then multiply, the parts of a composition. Holding constant any other inputs to P:

$\large \frac{dy}{dx} = \frac{dS}{dx} = \frac{dS}{dP} * \frac{dP}{dx}$

This lets us build a (so far non-branching) graph of forward links, in which the parts are separately differentiable, and where multiplying the derivatives gives the cumulative derivative across a subsequence of links.
4) What if there are multiple inputs, outputs or links?
- Inputs)
  Multiple inputs imply a set of predecessors P' feeding into S. Forward calculations simply use a multi-argument function to calculate the output S. No problem, especially in the easy peasy world of neural networks, where a multi-argument function is merely a sum.
  For backward calculation of derivatives we hold all the inputs constant but one, and differentiate with respect to that one, using the normal chain rule method. And if we can do one, we can do them all the same way.
- Outputs)
  Multiple outputs is what we'd call a vector-valued function. It might be worth thinking about for a neural network to consider a subset of nodes, not just one, as being fed with the same set of predecessors and the same weights, producing a vector or set of outputs. Weight sharing might well speed training, if the functions themselves are naturally multi-valued. A weight matrix would be cheating, in this idea, since you might as well separate the nodes each with its own predecessor weight vector, equivalent to having a matrix of weights separately trained.
  But generally we can get multiple outputs by using multiple nodes each with a single output. So much for multiple outputs.
- Links)
  Multiple input links to a node require a function taking multiple inputs. For example taking many inputs, a sum, or product, a weighted sum, a weighted log-sum (equivalent to product of input^(weight)) or even logical combinations of just one, two, or few inputs, each can be calculated straightforwardly given the inputs from the previous layer. Further, going backwards and applying the chain rule to take derivatives across the network, each node's output can be can be separately differentiated with respect to each of the possibly many inputs. This allows backward derivative accumulation to proceed just as with single predecessors, simply, by applying the chain rule to pre-multiply with the respective derivatives, once each for as many times as there are predecessors. All this is what falls out without trying; hence, multiple input links are easy.
  But consider multiple output links from a node. These either represent multiple outputs, discussed above, or a single output from a node, taken as the input to multiple successors. In the latter case, forward calculations take the single output of a node and just copy or share it to be fed as input to each of perhaps many successor: simply a no-op. We have handled all the easy cases; the one which remains, the backward, differentiation process, is where we have a problem.
  Suppose we are doing backward propagation of the differentiation process, layer by layer across the network from the end toward the beginning. Suppose we encounter a node with multiple successors. We have to take the derivative of forward-expanding node. Let's say we have node A with its own input x, and feeding at least two successors, say B and C. Locally, B(A(x)) and C(A(x)) is the local compositions of functions. Differentiating backward, we might guess that the derivative of multiple successors is the sum of multiple derivatives. This is consistent with the idea of Total Derivative in which if $z = f(x(t),y(t))$, then the total derivative
  
  $\large \frac{dz}{dt} = \frac{df}{dt} = \frac{df}{dx} \frac{dx}{dt} +\frac{df}{dy} \frac{dy}{dt}$.
  In English, the derivative from multiple successor paths is the sum of the chained derivatives along each successor path.
  By what intuition should this be the case for neural networks? We seek to find how an error at the end of the forward calculation varies with respect to the different parameters of the model. $\large \frac{de}{dp}$ for some parameter $p$. One node's output is $t$; it is taken as input by multiple nodes such as $x$, and $y$; in the end the system output is a function of both those paths, $f()$, and we want to know what the derivative of the error is with respect to $t$, when $t$ feeds into the error by multiple different functions of $t$.
  Another way to think of it is, if $t$ changes it drives a change in $e$ via the path $x$, and drives a further separately calculable change in $e$ via the path $y$, and we might approximate the joint effects through both paths $x$ and $y$ on $e$ as the sum of the two effects through the two separate paths. It might be an overestimate, but if it approaches zero effect via each path separately then we might expect that the joint effect would also tend toward zero. Anyway on that intuition and the theory of Total Derivatives, we can proceed.

Hence we have every kind of branching, forward and backward, at any given node, and we know how to calculate the outputs forward, using the associated functions, and we know how to calculate the derivatives backward. In short, we can build a compositional graph, calculate its outputs forward, and calculate its derivatives with respect to any interior function or parameter, backward.

All Hail the Chain Rule!

This identity between flow of information across nodes in a graph and function composition is the key idea in neural networks. Then with the chain rule we can now go beyond just calculating outputs layer by layer, but by the chain rule, derivatives layer by layer, which is what lets us train them, by seeking those zero-slope parameter values where they can't be improved any more.

Great. Remember the chain rule.

Neural networks, $a$ to $z$

Define a Neural Network to be a function f, relating inputs ā to estimated output(s) ẑ, and parameterized as:

{$ n, L, W, \theta, e $}

where

n is a number of nodes.
L is an ordered set of layers or levels, numbered 0 to L≜|L|-1, inclusive.
- L_l, often written just $l$, is the l'th level, that is, the set of nodes reached l links away from the input layer L₀.
- L_L = is the set of final node(s).
W is a matrix of "weights" $w_{i,j}$ ∈ ℝ used to scale (multiply) quantities flowing from node i to node j.
- W_l is the submatrix of W comprising the $w_{i,j}$'s for which i ∈ L_l-1 and j ∈ L_l for some l.
- $w_{i,j}$ ≠ 0 only if $w_{i,j}$ is in some W_l (layered networks have sparse W).
- One may write the (matrix*vector=vector) product, W_l∙ō_l−1 = $\bar{s}$_l, for all of level l's weighted sums. (I won't.)
- We can also consider W_l as a matrix multiply function, W_l($\bar{v}$) ≜ W_l∙$\bar{v}$. (I will use this once.)
θ for "threshold", is a non-linear function, which (typically) maps its input to [0,1]. Commonly
- θ(x) = logistic(x) ≜ $ 1 / (1+e^{-x}) $, for which θ(-∞)→0, θ(∞)→1, and θ(0)=1/2. In this case the derivative is
  θ'(x) = θ(x)(1-θ(x)). (See graphs below.)
- When applied to a vector, θ($\bar{v}$) is understood to be the vector of element-wise outputs θ(v_i).
e is an error or loss or cost function such as e(ẑ,z) = $ \frac{1}{P} * $ (ẑ − z)^P, e.g., P=2.

We are given a train/test dataset R (Rows), mapping from a to z, for $0\le r\lt|R|$

(4) R[r] ≜ (a_r, z_r)

so that each observation (row) is an observed input a_r, associated with an observed (target) output z_r.

f's output, which we hope will be close to $z_r$, is calculated as the estimate $\hat z_r$, as follows:

A Neural Network is
a function f:

$ \newcommand{\COMPOP}{\Huge ⦾} $ $$\large f = \underset{\large l}{\COMPOP} \large W_l \circ \theta $$

L

(5) ẑ_r ≜ f(a_r) ≜ ⦾ W_l ∘ θ (a_r)

l = 0

considering each W_l as its matrix multiply function. In short, a neural network is a composition of weightings and thresholdings, layer by layer, from the inputs to the final output.

Concise is nice, but clear is best. So let's restate with more concreteness and perhaps clarity.

By definition of ⦾,

(6) ẑ_r = W₀ ∘ θ ∘ W₁ ∘ θ ... ∘ W_L ∘ θ (a_r) = W_L( θ( ... W₂( θ( W₁( θ( W₀(a_r))))) ...))

(I won't mention matrices again.)

This composition of layers in Neural Networks lets us calculate f by proceeding layer by layer across the network, calculating intermediate variables $s_j$ (sum) and $o_j$ (output) for each layer l, as follows:

SUM:
$s_j ≜\sum_i w_{i,j} * o_j$

OUTPUT:
$o_j≜\theta(s_j)$

(7) $s_j$ ≜ ∑_i $w_{i,j}$ * o_i ("sum")

and

(8) $o_j$ ≜ θ($s_j$) ("output")

Composition of layers also justifies using the chain rule of differentiation to calculate derivatives with respect to the $w_{i,j}$ ∈ W, also layer by layer, but from the last layer toward the first.

Graphically

A drawing can indicate calculations just as well as an equation, given meaningful graphical conventions.

All these are intuitive, but my contribution here is the gas gauge or "meter symbol", which I find especially helpful in drawing out dependencies so I can visualize how to apply the chain rule correctly. A meter symbol correctly suggests a range from empty to full, 0 to 100%; it helpfully separates the sum s from the output o, and it even looks a little bit like the θ that it represents.

If you like this kind of view, then behold, for your mere enjoyment, a completely imaginary 2-hidden-layer binary classifier, where ẑ>0.5 means Yes, otherwise No:

In this invented machine each system-input a_r is a vector of three input values, which are directly made over to be the outputs of three nodes in layer 0. Each (subsequent) layer calculates $s_j$ using equation (7) then $o_j$ using equation (8), for each node in that layer.

As the designer you get to pick which nodes have weights between them, and which don't. Fully connected layers, between adjacent layers, is just one design. Whatever, the algorithms don't care. The action is all in the weights that are not zeroes. Any zero weight will stay zero, is all you need to know.

To keep track of the calculations in the giant potential mess this can become, where subscripts and indices and layers are variables that increment across the network, it helps me to have a completely generic instance in mind, with established index names which have meaning in this generic case. Then I can minimize the mental effort of getting the indexes right, which can add so much noise as to limit clear thinking. Generic, but not chaotic, helps me achieve clarity.

To visualize the information flows, I apply these graphical conventions together in the simplest, but fully general, case of a single arbitrary node, j, at an arbitrary interior level, $0\lt l\lt L$. I am being careful to say nothing that reduces the generality of this picture, so that we can apply it to every node in the network, and therefore we do actually understand the entire process comprehensively, at least from the bottom up, and from one end to the other and back again.

So: Consider an arbitrary node, j, in an arbitrary level, l, along with an arbitrary predecessor of j, numbered i, in level l-1, and an arbitrary successor of j, numbered k, in level l+1. In my notation here, k's live on level l+1, so for example $\sum_k$ means a sum over all the *successors* of the given j. The forest of weights is hidden except for single weights in and out of node j, by way of example.

The last layer's output value $o_z$ is the function output f(ā) = ẑ.

There may be one or more nodes in the final layer, producing a vector or list for ẑ; which makes no difference except that your target outputs must also be organized the same way, with the same number of elements in z as the final layer has nodes, and that you will end up with different weights for the different end nodes. You will calculate -- using the same methods -- up to multiple endpoints. If you can do it for one, you can do it for them all.

So the essence is simple: Multiply, sum, threshold. Applying this across a whole network is the entire concept.

So now also in the visual domain, you have the idea. That's it!

What's left is to put this idea that you already know into explicit instructions that you could write code from, and then to understand how to train these things.

Feed Forward

At risk of repetition, let's break down the forward calculation. The network output,

(9) f(a_r) = ẑ ≜ $o_j$, j ∈ L_L

emerges after iterating over layers from L₀ to L_L, until finally j ∈ L_L.

At each layer, we calculate intermediate results $s_j$ and $o_j$ for each j ∈ L_l.

The sequence is:

Base case:
    For j ∈ L₀ set $o_j$ = a_r[j]     (inputs)

Recursion:
    For l ≔ 1 to L:
         For j ∈ L_l
             $s_j = \sum_i \in L_{l-1} o_i w_{i,j}$     and
             $o_j = \theta(s_j)$

At the end, $f(a_r) = \hat z_r = o_j$ where \(j \in L_{L}.

Retain with each node the local value of $ o_j$, since it will also be used later in training.

Standard in NN training is to use (11) (with b=l and α_b= W_l ∘ θ). (11) is the reason neural networks work. But it's slow. We adjust the weights using a fixed multiple of the direction $\large e'$, instead of using Newton-Raphson which needs the second derivative C² and which would let us at least do a linear approximation for an actual reasoned next estimate for a minimum-error setting of the weights. That ought to be many times faster -- depending on your step size -- as well as more precise.

A discussion of why Newton-Raphson should normally be much better than gradient descent, along both a geometrical and a mathematical derivation of the Newton Update, is here.

Neural Network Training

Starting with some kind of a neural network and some data, we want it to be better -- at producing estimated outputs $\hat z$ as close as possible to the $z$'s in our training data. Hopefully it'll do well also on the test data.

To be better, we must improve it, and the improveable part of a neural network is its weights. To improve the weights to get a better, trained, neural network model, we must adjust them from their original, possibly random or uniform but anyway far-from-ideal, untrained values.

How do we adjust them? In the direction of reducing errors. So, direction is the key concept. And we figure out what direction by using derivatives.

So we seek to minimize total error, E:

(13) E ≜ ∑_re_r.

With reference to any training row R[r] = (a_r, z_r) (leaving subscripts r unwritten), let:

$ \large e ≜ \frac{1}{2}(ẑ - z)^2 \ \ \ \ \\ \large \frac{\partial e}{\partial \hat z} = \hat z - z \ \ \ \ \\ \large \frac{\partial^2 e}{(\partial \hat z)^2} = 1 $

(14) e ≜ $ \frac{1}{2}$(ẑ − z)²

whence:

(15) $ \large \frac{\partial e}{\partial \hat z} $ $ \large = \frac{\partial }{\partial \hat z} \frac{1}{2} (\hat z - z)^{2} $

$ \large = \frac{2}{2}(\hat z - z)^1 \frac{\partial }{\partial \hat z} (\hat z-z) $

$ \large = (\hat z - z)*1 $

$ \large = \hat z - z $

and

(15.2) $ \large \frac{\partial^2 e}{(\partial \hat z)^2} $ $ \large = \frac{\partial }{\partial \hat z}(\hat z - z) $

$ \large = 1 $

We can add these up across some or all the training examples, r, because the derivative of a sum is the sum of the derivatives.

dE = d(∑e) = ∑de

Now (15) is the error derivative with respect to the output ẑ. That's the beginning of the story (starting from the end), but now we want to find the slope or derivative $ \large e' $ of the error with respect to each weight, and adjust the weight by a negative multiple of that derivative (negative in order to move downslope, to reduce the error).

Gradient Descent:
$ \large \hat w_{i,j} = w_{i,j} - \delta e'$

(16) $e'$ ≜ $ \large \frac{\partial e}{\partial w_{i,j}} $.
(17) $ \large \hat w_{i,j} = w_{i,j}-\delta e' $ (Gradient descent update rule)

which turns out equal to the Newton-Raphson update rule if we specify

$\large \delta = \frac{1}{e''}$

So we will seek not just $\large e'$, for gradient descent in the right direction, but also $\large e''$ to estimate the right distance to go.

Newton-Raphson:
$ \large \delta = \frac{1}{e''} $

(18) $\large e''$ ≜ $ \large \frac{\partial ^2e}{(\partial w_{i,j})^2} $
(19) $ \large \hat w_{i,j} = w_{i,j}-\frac{e'}{e''} $ (Newton-Raphson update rule)

Our task here will be to derive these derivatives. Haha!

There will be a ton of them, $e'_{i,j}, e''_{i,j}$, one for each weight $w_{i,j}$, but we will calculate them all, and could update all the weights, in a single backward pass.

$e'$

We begin with the error e_r. e_r is not a function of $w_{i,j}$ so we have some work to do to find its derivative with respect to $w_{i,j}$. Now e_r is indeed a decomposable composition of functions: At the final node(s), e is a function of o_final (==ẑ) (by 14), and at every node j, $o_j$ is a function of $s_j$ (by 8), and $s_j$ is a function of $w_{i,j}$ (by 7). That's true for o, s, and w at the final layer given the feedforward values that got us there, and it's also true at every preceding layer, if we propagate the calculations of $ \large A_j = \frac{\partial e}{\partial o_j} $ from the final layer backward from L_L to L₀. So (using H Notation, mentioned above), for i, i' ∈ L_l-1, j ∈ L_l, we can use the chain rule:

$ \large e'_{l,i,j} = \frac{\partial e}{\partial w_{i,j}} = $ $ \large \frac{\partial e}{\partial o_j} $ * $ \large \frac{\partial o_j}{\partial s_j} $ * $ \large \frac{\partial s_j}{\partial w_{i,j}} $ (20)

├──┤ ├───┤ ├───┤

A_j B_j C_j

├──┤ ├───┤ ├───┤

See
below $ o_j(1-o_j)$ o_i

$\large A_j, B_j, C_j$ are derived as follows (taking easiest first):

$\large C_j=o_i$

$ \begin{align*} \large C_j = \frac{\partial s_j}{\partial w_{i,j}} = \frac{\partial}{\partial w_{i,j}} \sum_{i'} w_{i',j}o_{i'} = \sum_{i'} o_i|_{i'=i}, 0|_{i'\ne i} = o_i \ \ \ \ \ \ (21) \end{align*}$

$\large θ(s)=\Large \frac{1}{1+e^{-s}}$

As mentioned the logistic function θ typically used to calculate node outputs, $ o_j = θ(s_j) $, is

$\large θ(s)=\frac{1}{1+e^{-s}} \ \ \ \ \ \ (22) $

and has a convenient derivative

$\large \frac{\partial θ(s)}{\partial s} = θ(s)(1-θ(s)) \ \ \ \ \ \ \ (23) $

giving us for $\large B_j$:

$\large B_j=o_j(1-o_j)$

$ \begin{align*} \large B_j = \frac{\partial o_j}{\partial s_j} = \frac{\partial θ(s_j)}{\partial s_j} = θ(s_j)(1-θ(s_j)) = o_j(1-o_j) \ \ \ \ \ (24) \\ \end{align*} $

Our formulas will make much use of $ B_j = \frac{\partial o_j}{\partial s_j} = o_j(1-o_j) $.

Now $\large A_j$ is the key to backpropagation; it is defined first at the end for final node(s), then at each preceding level it is built up using its previously calculated values from the next later level. The base case of this recursion is with j final:

$ \large A_j = \frac{\partial e}{\partial o_j} = \frac{\partial e}{\partial \hat z} = \hat z - z \ \ \ \ \ (cf\ 15) $

Then for all the non-final cases, the recursion must add up the effects from each successor to j, that is, for each k ∈ L_l+1 a successor of j (we always assume j ∈ L_l), as follows:

$\large A_j = \frac{\partial e}{\partial o_j} = $ $\large \sum_{k} $ $ \large \frac{\partial e}{\partial o_k} $ * $ \large \frac{\partial o_k}{\partial s_k} $ * $ \large \frac{\partial s_k}{\partial o_j} $ (25) (by the chain rule)

├──┤ ├──┤ ├──┤

A_k B_k F

Where:

$ \large A_k$: was already calculated in layer l+1.

$ \large B_k = o_k(1-o_k)$ Just as for $\large B_j$.

$ \large F = \frac{\partial s_k}{\partial o_j} $ $\large = \frac{\partial}{\partial o_j} ( \sum_{j'} o_{j'} w_{j',k} ) $

$\large = \sum_{j'} w_{j,k}|j'=j, \ \ \ \ 0|j'\ne j. $

$\large = w_{j,k} $

Hence

$\large A_j=\sum_{k} w_{j,k} B_k A_k$

$\begin{align*} \large A_j = \frac{\partial e}{\partial o_j} &= \sum_{k} F*B_k*A_k \ \ \ \ \ \ \ \ \ = \sum_{k} w_{j,k}\ o_k (1-o_k) \frac{\partial e}{\partial o_k} \\ &= \sum_{k} w_{j,k}\ o_k (1-o_k) A_k \ \ \ \ \ \ \ (26.0) \\ \end{align*}$.

Substituting back into (20):

$\large e' = o_i B_j A_j$

$\begin{align*} \large e'_{i,j} &= \frac{\partial e}{\partial w_{i,j}} = \large C_j B_j A_j = C_j B_j \frac{\partial e}{\partial o_j} = C_j B_j \sum_{k} (F B_k A_k) \\ \large e' &= \large o_i o_j (1-o_j) \sum_{k}(w_{j,k} o_k (1-o_k) A_k) \ \ \ \ \ \ \ \ (26.1) \\ \end{align*} $

Recursive and parallel implementation

The recursive definitions of $A_j$ and $A'_j$ in terms of their successors $A_k$ and $A'_k$ are what justifies the back propagation from the final node(s) to each layer of their predecessors.

We build the computation graph according to the following types of flows, with each node's intermediate values of s's and o's, etc., each making use of previous calculations.


Feed-forward	Back-propagate
├──────┤	├─────────────┤
a → o → s → ẑ	→ A_{k ∈ L_l} → A_{j ∈ L_l−1} → .. → A_L₀
	→$\large e'_{l,i,j} $ → $w_{i,j}$

Parallel computation can be achieved by dividing the training data according to the number of available processors and separately calculating $\large \Delta_{i,j}$'s or $\frac{\partial e_r}{\partial w_{i,j}}$ for each $e_r$, and then combining the results when completed by summing the Δ's across training data subsets.

Alternatively, processors can be assigned to subsets of the network, to make feedforward or backward calculations, as the data becomes available to their respective subset of the network. The results must be made available to the successor processors, keeping results separate for each observation. Then inputs can be processed both forward and backward not so much simultaneously as overlappingly, in pipelined, first-in-first-out fashion.

(Formal analysis of the algorithm will benefit from the concepts in George Mou's 1991 Yale CS thesis on parallel computation, for example the divide/combine index operations.)

Discussion

$A_j = \hat z - z$ is often called the error signal, although it is the derivative of the error signal $e=\frac{1}{2}(\hat z - z)^2$. As the derivative, it carries the direction toward the minimum of errors, not just the magnitude of the error.

Proceeding layer-by-layer this adjustment indicator propagated from the final layer all the way to the start layer, leaving behind evidence, that is, de/d$o_j$, at each node j, so as to calculate the new value for $w_{i,j}$ The update process follows the existing network structure: first as to computation flow; second by leaving the zero weights unchanged. Thus the network shape remains unchanged, the number of nodes at each layer remains unchanged, only the weights adjust.

The updates are scaled:

on the recursion (for de/do_j←k)
- by w_j,k and
- by (o_k − o_k²)
on the re-estimation of $w_{i,j}$
- by o_i,
- by ($o_j$ − $o_j$²), and
- by all the successor w_j,k's and (o_k−o_k²)'s.

Backpropagation Intuitions

I like to think of the equation for Δ under a certain intuitive interpretation. If the threshold function θ() has its output bounded by [0,1], as is the case with the logistic threshold, $o=\theta(s)=\large \frac{1}{1+e^{-s}}$, among others, the output of any node j, labelled $o_j$, has the property

0 ≤ $o_j$ ≤ 1

from which it follows that

0 ≤ $o_j$² ≤ $o_j$, and
0 ≤ $o_j$(1−$o_j$) = $o_j$−$o_j$² ≤ $o_j$.

It always helps to contemplate a graph of the relevant functions.

Consider illuminating cases:

$L(\infty)\xrightarrow{} \large \frac{1}{1+e^{-\infty}} = \frac{1}{1+0} = 1$,
$L(0)\ \ =\ \ \large \frac{1}{1+e^0} = \frac{1}{1+1}=\frac{1}{2},\ \ \ $and
$L(-\infty)\xrightarrow{} \large \frac{1}{1+e^{\infty}} = 0$.

That is, if $o_j$ is interpreted as a degree of certainty, then $o_j$*(1−$o_j$) = $o_j$−$o_j$² is always ≥ 0, and always less certain than $o_j$. The less certain is $o_j$, the more dramatically uncertain is $o_j$ − $o_j$², even though that is always ≥ 0.

Define three degrees of uncertainty to apply to a passed-down error-derivative signal, as follows:

SortOf(j) ≜ $o_j$

SortOf ∘ SortOf(j) ≜ $o_j$²

Maybe(j) ≜ B_j = $o_j$−$o_j$² = SortOf(j) − SortOf(SortOf(j))

UpError(k) ≜ de/do_k

de/d$w_{i,j}$ =	de/d$o_j$	*	d$o_j$/d$s_j$	*	d$s_j$/d$w_{i,j}$	=	C	*	B	*	A
	├──┤		├───┤		├───┤		├───┤		├───┤		├───────────┤
	A		B		C		SortOf(i)		Maybe(j)		∑_kw_j,kMaybe(k)UpError(k)
			├───┤		├───┤
			$o_j$*(1−$o_j$)		o_i

Thus the update signal Δ at a node j at level l is a diluted sum of the backpropagating error signals UpError(k) from successor nodes at level l+1, scaled up by the w_j,k weights, but diluted by enhanced uncertainties on the links backward from different nodes k to the node j, and by enhanced uncertainty on the node j output $o_j$, but only by the inherent uncertainty of the output of node i. Hence:

Δ_i,j = SortOf(i) * Maybe(j) * ∑_k w_j,k * Maybe(k) * UpError(k).

The certainty at layer l−1 is left alone, but at layers l and l+1 it is enhanced by squaring the uncertainty, and then combining these for all successors k of the i,j link, we make our estimate, Δ of the backpropagating error direction. Adjusting all the $w_{i,j}$'s in the opposite direction, to reduce the errors, with a sequence of run-throughs of the training data set, is how backpropagation brings increased overall certainty to its mapping of the chaos of the training set to something close to the desired outputs f(a_r).

The Dunning Kruger effect is where a person who doesn't know, thinks he knows (more often a "he", indeed). Ignorance combined with not wanting to learn, would be an even worse combination. Backpropagation seems to follow an Anti-Dunning-Kruger approach.

When the error-slopes and errors are small or the certainty levels low, then the pre-existing mass of the weights at later layers move more slowly toward the earlier layers. On the other hand when the certainty levels are high and the errors and error-slopes are larger, then the overall mass of the later weights move most, and most quickly, toward the earlier layers. Thus learning is accelerated by high certainties in the feedforward analysis combined with a large error-derivative, defining across the weights taken as a whole a definite direction to improve towards.

This is like curiosity combined with intellectual humility: if I really thought I had it right, but I was wrong, that's when I most need to learn, and indeed I have the greatest opportunity to learn.

These factors of error magnitude and (forward) analysis certainty operate in a trading relationship, but the general idea that for re-estimating a weight $w_{i,j}$ we leave alone, or incorporate directly, the degree of uncertainty taken from o_i to be part of our re-estimations; whereas, we enhance the uncertainty (by subtracting its square) from the later outputs $o_j$ and o_k. Generally, the learning process in this system moves the mass of the later-layer weights w_j,k into the earlier-layer weights $w_{i,j}$, taking careful account of the uncertainties presently encoded, but moving more where the errors are larger.

I hope this helps you to have an intuitive understanding of the backpropagation concept.

Any comments are welcome and much appreciated.

form outputs here. in comment_form() with from=bottom, uri=nn/nn.php

(11) C¹(x) ≜ \( \large \frac{d}{dx} C(x) = \frac{d}{dx} \)	⦾	\( \large \alpha_b (x) = \)	\( \large\prod \)	\( \large \frac{\partial \alpha_{b}}{\partial \alpha_{b-1}} \) (x)
(11) C¹(x) ≜ \( \large \frac{d}{dx} C(x) = \frac{d}{dx} \)	b	\( \large \alpha_b (x) = \)	b

(15) \( \large \frac{\partial e}{\partial \hat z} \)	\( \large = \frac{\partial }{\partial \hat z} \frac{1}{2} (\hat z - z)^{2} \)
	\( \large = \frac{2}{2}(\hat z - z)^1 \frac{\partial }{\partial \hat z} (\hat z-z) \)
	\( \large = (\hat z - z)*1 \)
	\( \large = \hat z - z \)

\( \large A_k\):	was already calculated in layer l+1.
\( \large B_k = o_k(1-o_k)\)	Just as for \(\large B_j\).
\( \large F = \frac{\partial s_k}{\partial o_j} \)	\(\large = \frac{\partial}{\partial o_j} ( \sum_{j'} o_{j'} w_{j',k} ) \)
	\(\large = \sum_{j'} w_{j,k}\|j'=j, \ \ \ \ 0\|j'\ne j. \)
	\(\large = w_{j,k} \)

	Your thoughts?
Name⁺:
Email^*:	(will not be shared or abused)
Comment:
	Feedback is welcome.

An Accessible Introduction to Neural Networks

Notation

Hints and Indexes

The Chain Rule is Everything

Graphical Chain Rule

Neural networks, \(a\) to \(z\)

Graphically

Feed Forward

Neural Network Training

\(e'\)

Recursive and parallel implementation

Discussion

Backpropagation Intuitions

Copyright © 2023 Thomas C. Veatch. All rights reserved.
Created: August 26, 2023

(3) C¹ = (	⦾	\( α_b \)	)¹ =	∏	\( α_b ^1 \)	\( \ \ \ \ \ \ \ (CR) \)
	b			b

(15.2) \( \large \frac{\partial^2 e}{(\partial \hat z)^2} \)	\( \large = \frac{\partial }{\partial \hat z}(\hat z - z) \)
	\( \large = 1 \)

\( \large e'_{l,i,j} = \frac{\partial e}{\partial w_{i,j}} = \)	\( \large \frac{\partial e}{\partial o_j} \)	*	\( \large \frac{\partial o_j}{\partial s_j} \)	*	\( \large \frac{\partial s_j}{\partial w_{i,j}} \)	(20)
	├──┤		├───┤		├───┤
	A_j		B_j		C_j
	├──┤		├───┤		├───┤
	See below		\( o_j(1-o_j)\)		o_i

\(\large A_j = \frac{\partial e}{\partial o_j} = \)	\(\large \sum_{k} \)	\( \large \frac{\partial e}{\partial o_k} \)	*	\( \large \frac{\partial o_k}{\partial s_k} \)	*	\( \large \frac{\partial s_k}{\partial o_j} \)	(25) (by the chain rule)
\(\large A_j = \frac{\partial e}{\partial o_j} = \)		├──┤		├──┤		├──┤
		A_k		B_k		F

SortOf(j)	≜ \(o_j\)
SortOf ∘ SortOf(j)	≜ \(o_j\)²
Maybe(j)	≜ B_j = \(o_j\)−\(o_j\)² = SortOf(j) − SortOf(SortOf(j))
UpError(k)	≜ de/do_k

de/d\(w_{i,j}\) =	de/d\(o_j\)	*	d\(o_j\)/d\(s_j\)	*	d\(s_j\)/d\(w_{i,j}\)	=	C	*	B	*	A
	├──┤		├───┤		├───┤		├───┤		├───┤		├───────────┤
	A		B		C		SortOf(i)		Maybe(j)		∑_kw_j,kMaybe(k)UpError(k)
			├───┤		├───┤
			\(o_j\)*(1−\(o_j\))		o_i

Notation

Hints and Indexes

The Chain Rule is Everything

Graphical Chain Rule

Neural networks, \(a\) to \(z\)

Graphically

Feed Forward

Neural Network Training

\(e'\)

Recursive and parallel implementation

Discussion

Backpropagation Intuitions

Copyright © 2023 Thomas C. Veatch. All rights reserved. Created: August 26, 2023

Copyright © 2023 Thomas C. Veatch. All rights reserved.
Created: August 26, 2023