An Accessible Introduction to Neural Networks

by Tom Veatch


| NN Top | NN Introduction | Newton's Method | Fuzzy Logic NNs | & training rule | Twist-and-Bend | Space Representing NNs | Dual Quaternions | Algebroids | Robots |

Given high school algebra and a little trigonometry and derivative rules, plus some patience, you can here learn the ideas and math behind neural networks, plus improvements and intuitions about them which I haven't seen elsewhere. Please take notes and enjoy the ride, and the view at the top!

Notation

Consider adding these bits of Veatch notation to your armamentarium.

Equality is indicated horizontally by =, vertically by ├─┤ (pronounced "H"), definitionally by , and horizontally between expressions or equations by .

I especially hope you will use H Notation which considerably condenses and simplifies homework and mathematical (and linguistic) derivations. Reduce the pointless rewriting of long equations, and replace the sloppy scribbling of brackets and braces and circles and arrows. H notation provides a tidy, specific, precise tool to make your meaning clear and exact. It says the above part is equal to, and can be replaced by, this other part below. Please use H Notation, to help yourself as well as your readers and students to easily follow even complex derivations.

More:

Evaluation: \(\big|^c\) indicates evaluation under \(c\), one of multiple conditions.

Elsewhere: indicates the "elsewhere" or "otherwise" condition. For example, \(max(a,b)= \big|^{a>b} a,\ \ \big|\) \(b\). Sweet and simple, and fun to draw.

Last, function composition is normally represented by , pair-wise, (but let me add) by , list-wise.

(1)    f ∘ g (x) ≜ g(f(x)).

Both evaluation orders, as-read-first and nearest-first order (inside out), are used and found intuitive by different people. To choose explicitly, let us use reading order, taking inner to outer as being represented on the left and right sides of the ∘ symbol, respectively. Also, let composition take associative priority over function application, so that: f ∘ g (x) = (f ∘ g)(x).

I will let represent (ordered) function composition of its argument list, like ∑ and ∏ represent (unordered) summation and product.

These list-wise operators are one way to demonstrate that some kind of a truth, some actual generalization, has been found, because instead of an enumeration of cases, one must have found the generating expression which captures all the cases.

It is important to capture an idea, insight, or observation, at the correct level of abstraction,
because otherwise you haven't really understood what you think you know.

An enumeration is less, a generating expression more, abstract.

Given \(B\) functions αb, \((0\le b\lt B)\), define a composition C

(2.1)   C ≜ αb
b

indicating
(2.2)   C(x)   =  α0 ∘ α1 ∘ α2 ∘ ... αB-1 (x)       =      αB-1( ... α2( α1( α0(x)))...)

Using we can define an entire neural network in four symbols:  W ∘ \(\theta\).

Hints and Indexes

Evocative or meaningful variable names help to think clearly in a complex space. And indexes for a list operator can often be hinted at or expressed as a set rather than a sequence of numbers, to keep the math simple.

So I'll often let you figure out the indexes for the argument list. The index range may be inferrable from context, as in \(\sum_e\), or by set membership rather than a sequence of numbers. For example, \(\large\sum\)\(e \in S\) lets you enumerate later (if you ever need to write it as code): just get all the values \(e\) in the set \(S\). Too much obsession about the indexes when you're trying to capture the simple essence of something is distracting and defeats the weak mind of this author.

Generally, zero-based enumeration, 0...N-1, for the N elements in a list, does help clear thinking, as I learned from the great George Mou's Yale PhD thesis, A Formal Model for Parallel Processing. He uses index-renumbering functions which take a subset of indexes of a larger set, and renumbers them from 0 to N-1, so that within the discussion of the subset you don't have to keep track of where you are in the superset. Then you can refer to the indexes of members of a subset as counting from 0 to N-1. Easy!

Also, I find it quite helpful to use variable names that are evocative and mnemonic: \(r\) for row number, \(a\) to \(z\) represent inputs to outputs, \(l\) for layer, \(w\) for weight, \(E\) and \(e\) for error, \(s\) for sum, \(o\) for (node) output, \(C\) for a composition. Also, I associate in my mind, \(i\), \(j\), and \(k\), with nodes in the previous, current, and next layers of the network. respectively, so \(j\) refers to any node in the current level \(l\), and \(\large \sum_i\) refers to all nodes \(i\) in level \(l-1\), etc. It's easier to think about complex matters when the variable names are more meaningful. Still, some variable names merely came to hand when needed, \(\alpha\), \(B\), \(A\), \(F\), \(T\).

Let Cn be the nth derivative of C, \( α^n_b \) the nth derivative of \( α_b\), etc. Hence C0 = C.

In some cases where n is 1 or 2, I will also write \(\large e'\) for \(\large \frac{\partial e}{\partial w_{i,j}}\), \(\large e''\) for \(\large \frac{\partial^2 e}{(\partial w_{i,j})^2}\), and A' for \(\large \frac{\partial A}{\partial o_s}\).

The Chain Rule is Everything

The Chain Rule (CR) says the derivative of a composition is the product of the derivatives.
(10)      \( \large \frac{d}{dx} g(f(x)) = \frac{\partial g}{\partial f} * \frac{df}{dx} \)
Repeatedly applying the chain rule to a repeated composition C (setting α-1 ≜ x) yields its first derivative C1:
(11)    C1(x) ≜ \( \large \frac{d}{dx} C(x) = \frac{d}{dx} \) \( \large \alpha_b (x) = \) \( \large\prod \) \( \large \frac{\partial \alpha_{b}}{\partial \alpha_{b-1}} \) (x)
 b   b

Or in fewer symbols:

(3)   C1 = ( \( α_b \) )1 = \( α_b ^1 \) \( \ \ \ \ \ \ \ (CR) \)
b b

Graphical Chain Rule

The chain rule is why we can build a computation flow graph. Each node is just a function that takes its inputs to its outputs, but those inputs are themselves the results of functions at previous layers. Stepping along the graph, at each node you compose a function on top of the outputs of the preceding nodes. Each link in the graph represents an additional layer of composition. And each node extends the various compositions that have been built up to the preceding nodes by its own function which combines those nodes' outputs. Therefore each node is not just a locally-calculated function of any old inputs, everything thrashing uncoordinatedly on its own, No: it is an organized, global, or rather feedforward, summary or transformation of the whole process so far up to that point, because each one is a composition of all its predecessors.

Let's be careful about what we can do with graphs here, and let's consider the cases for a focus node B. Our reasoning should apply to any node, with zero, one or more predecessors, and zero, one or more successors.

The function composition relationship is that the output of the inner function is the input of the outer function. A feeds B. B(A(x)) is the composition of B over A.

Hence we have every kind of branching, forward and backward, at any given node, and we know how to calculate the outputs forward, using the associated functions, and we know how to calculate the derivatives backward. In short, we can build a compositional graph, calculate its outputs forward, and calculate its derivatives with respect to any interior function or parameter, backward.

All Hail the Chain Rule!

This identity between flow of information across nodes in a graph and function composition is the key idea in neural networks. Then with the chain rule we can now go beyond just calculating outputs layer by layer, but by the chain rule, derivatives layer by layer, which is what lets us train them, by seeking those zero-slope parameter values where they can't be improved any more.

Great. Remember the chain rule.

Neural networks, \(a\) to \(z\)

Define a Neural Network to be a function f, relating inputs ā to estimated output(s) ẑ, and parameterized as:

{\( n, L, W, \theta, e \)}

where

We are given a train/test dataset R (Rows), mapping from a to z, for \(0\le r\lt|R|\)

(4)    R[r] ≜ (ar, zr)

so that each observation (row) is an observed input ar, associated with an observed (target) output zr.

f's output, which we hope will be close to \(z_r\), is calculated as the estimate \(\hat z_r\), as follows:

A Neural Network is
a function f:
\( \newcommand{\COMPOP}{\Huge ⦾} \) $$\large f = \underset{\large l}{\COMPOP} \large W_l \circ \theta $$

L
(5)    ẑr ≜ f(ar) ≜ Wl ∘ θ (ar)
l = 0

considering each Wl as its matrix multiply function. In short, a neural network is a composition of weightings and thresholdings, layer by layer, from the inputs to the final output.


Concise is nice, but clear is best. So let's restate with more concreteness and perhaps clarity.

By definition of ,

(6)   ẑr   =   W0 ∘ θ ∘ W1 ∘ θ ... ∘ WL ∘ θ (ar)    =   WL( θ( ... W2( θ( W1( θ( W0(ar))))) ...))

(I won't mention matrices again.)

This composition of layers in Neural Networks lets us calculate f by proceeding layer by layer across the network, calculating intermediate variables \(s_j\) (sum) and \(o_j\) (output) for each layer l, as follows:

SUM:
\(s_j ≜\sum_i w_{i,j} * o_j\)

OUTPUT:
\(o_j≜\theta(s_j)\)   

(7)   \(s_j\) ≜ ∑i \(w_{i,j}\) * oi   ("sum")

and

(8)   \(o_j\) ≜ θ(\(s_j\))   ("output")

Composition of layers also justifies using the chain rule of differentiation to calculate derivatives with respect to the \(w_{i,j}\) ∈ W, also layer by layer, but from the last layer toward the first.

Graphically

A drawing can indicate calculations just as well as an equation, given meaningful graphical conventions.

Graphical Conventions for Neural Network Calculations W w MULTIPLY by a weight or importance value, w. arc means * SUM all inputs s arrowhead means + θ meter: threshold_meter o THRESHOLD non-linearity: map to [0,1] meter symbol means θ(s)=o 𝑎 𝑎 system INPUT feeds layer 0 inputs come in squares
All these are intuitive, but my contribution here is the gas gauge or "meter symbol", which I find especially helpful in drawing out dependencies so I can visualize how to apply the chain rule correctly. A meter symbol correctly suggests a range from empty to full, 0 to 100%; it helpfully separates the sum s from the output o, and it even looks a little bit like the θ that it represents.

If you like this kind of view, then behold, for your mere enjoyment, a completely imaginary 2-hidden-layer binary classifier, where ẑ>0.5 means Yes, otherwise No:

Detectron: A 2-hidden-layer Neural Network 𝑎[0] 𝑎[1] 𝑎[2] meter: s0 o0 w04 meter: s1 o1 w15 w14 meter: s2 o2 w26 w25 meter: s3 o3 w36 meter: s4 o4 w47 meter: s5 o5 w57 meter: s6 o6 w67 meter: s7 o7 H L0 H L1 H L2 H L3 = ẑ
In this invented machine each system-input ar is a vector of three input values, which are directly made over to be the outputs of three nodes in layer 0. Each (subsequent) layer calculates \(s_j\) using equation (7) then \(o_j\) using equation (8), for each node in that layer.

As the designer you get to pick which nodes have weights between them, and which don't. Fully connected layers, between adjacent layers, is just one design. Whatever, the algorithms don't care. The action is all in the weights that are not zeroes. Any zero weight will stay zero, is all you need to know.

To keep track of the calculations in the giant potential mess this can become, where subscripts and indices and layers are variables that increment across the network, it helps me to have a completely generic instance in mind, with established index names which have meaning in this generic case. Then I can minimize the mental effort of getting the indexes right, which can add so much noise as to limit clear thinking. Generic, but not chaotic, helps me achieve clarity.

To visualize the information flows, I apply these graphical conventions together in the simplest, but fully general, case of a single arbitrary node, j, at an arbitrary interior level, \(0\lt l\lt L\). I am being careful to say nothing that reduces the generality of this picture, so that we can apply it to every node in the network, and therefore we do actually understand the entire process comprehensively, at least from the bottom up, and from one end to the other and back again.

So: Consider an arbitrary node, j, in an arbitrary level, l, along with an arbitrary predecessor of j, numbered i, in level l-1, and an arbitrary successor of j, numbered k, in level l+1. In my notation here, k's live on level l+1, so for example \(\sum_k\) means a sum over all the *successors* of the given j. The forest of weights is hidden except for single weights in and out of node j, by way of example.

* wij meter: si oi wij meter: sj oj wjk meter: sk ok 𝑎 H H H H Layer 0 ℓ-1 ℓ+1 𝐿 H meter: sz oz i j k

The last layer's output value \(o_z\) is the function output f(ā) = ẑ.

There may be one or more nodes in the final layer, producing a vector or list for ẑ; which makes no difference except that your target outputs must also be organized the same way, with the same number of elements in z as the final layer has nodes, and that you will end up with different weights for the different end nodes. You will calculate -- using the same methods -- up to multiple endpoints. If you can do it for one, you can do it for them all.

So the essence is simple: Multiply, sum, threshold. Applying this across a whole network is the entire concept.

So now also in the visual domain, you have the idea. That's it!

What's left is to put this idea that you already know into explicit instructions that you could write code from, and then to understand how to train these things.

Feed Forward

At risk of repetition, let's break down the forward calculation. The network output,
(9)    f(ar) = ẑ ≜ \(o_j\), j ∈ LL
emerges after iterating over layers from L0 to LL, until finally j ∈ LL.

At each layer, we calculate intermediate results \(s_j\) and \(o_j\) for each j ∈ Ll.

The sequence is:

Base case:
    For j ∈ L0 set \(o_j\) = ar[j]     (inputs)

Recursion:
    For l ≔ 1 to L:
         For j ∈ Ll
             \(s_j = \sum_i \in L_{l-1} o_i w_{i,j}\)     and
             \(o_j = \theta(s_j)\)

At the end, \(f(a_r) = \hat z_r = o_j\) where \(j \in L_{L}.

Retain with each node the local value of \( o_j\), since it will also be used later in training.

Standard in NN training is to use (11) (with b=l and αb= Wl ∘ θ). (11) is the reason neural networks work. But it's slow. We adjust the weights using a fixed multiple of the direction \(\large e'\), instead of using Newton-Raphson which needs the second derivative C2 and which would let us at least do a linear approximation for an actual reasoned next estimate for a minimum-error setting of the weights. That ought to be many times faster -- depending on your step size -- as well as more precise.

A discussion of why Newton-Raphson should normally be much better than gradient descent, along both a geometrical and a mathematical derivation of the Newton Update, is here.

Neural Network Training

Starting with some kind of a neural network and some data, we want it to be better -- at producing estimated outputs \(\hat z\) as close as possible to the \(z\)'s in our training data. Hopefully it'll do well also on the test data.

To be better, we must improve it, and the improveable part of a neural network is its weights. To improve the weights to get a better, trained, neural network model, we must adjust them from their original, possibly random or uniform but anyway far-from-ideal, untrained values.

How do we adjust them? In the direction of reducing errors. So, direction is the key concept. And we figure out what direction by using derivatives.

So we seek to minimize total error, E:

(13)    E ≜ ∑rer.

With reference to any training row R[r] = (ar, zr) (leaving subscripts r unwritten), let:
\( \large e ≜ \frac{1}{2}(ẑ - z)^2 \ \ \ \ \\ \large \frac{\partial e}{\partial \hat z} = \hat z - z \ \ \ \ \\ \large \frac{\partial^2 e}{(\partial \hat z)^2} = 1 \)

(14)    e ≜ \( \frac{1}{2}\)(ẑ − z)2
whence:

(15)     \( \large \frac{\partial e}{\partial \hat z} \) \( \large = \frac{\partial }{\partial \hat z} \frac{1}{2} (\hat z - z)^{2} \)
\( \large = \frac{2}{2}(\hat z - z)^1 \frac{\partial }{\partial \hat z} (\hat z-z) \)
\( \large = (\hat z - z)*1 \)
\( \large = \hat z - z \)
and
(15.2)     \( \large \frac{\partial^2 e}{(\partial \hat z)^2} \) \( \large = \frac{\partial }{\partial \hat z}(\hat z - z) \)
\( \large = 1 \)
We can add these up across some or all the training examples, r, because the derivative of a sum is the sum of the derivatives.

dE = d(∑e) = ∑de
Now (15) is the error derivative with respect to the output ẑ. That's the beginning of the story (starting from the end), but now we want to find the slope or derivative \( \large e' \) of the error with respect to each weight, and adjust the weight by a negative multiple of that derivative (negative in order to move downslope, to reduce the error).

Gradient Descent:

\( \large \hat w_{i,j} = w_{i,j} - \delta e'\)

(16)    \(e'\) ≜ \( \large \frac{\partial e}{\partial w_{i,j}} \).
(17)     \( \large \hat w_{i,j} = w_{i,j}-\delta e' \)      (Gradient descent update rule)

which turns out equal to the Newton-Raphson update rule if we specify
\(\large \delta = \frac{1}{e''}\)
So we will seek not just \(\large e'\), for gradient descent in the right direction, but also \(\large e''\) to estimate the right distance to go.
Newton-Raphson:

\( \large \delta = \frac{1}{e''} \)

(18)    \(\large e''\) ≜ \( \large \frac{\partial ^2e}{(\partial w_{i,j})^2} \)
(19)     \( \large \hat w_{i,j} = w_{i,j}-\frac{e'}{e''} \)            (Newton-Raphson update rule)

Our task here will be to derive these derivatives. Haha!

There will be a ton of them, \(e'_{i,j}, e''_{i,j}\), one for each weight \(w_{i,j}\), but we will calculate them all, and could update all the weights, in a single backward pass.

\(e'\)

We begin with the error er. er is not a function of \(w_{i,j}\) so we have some work to do to find its derivative with respect to \(w_{i,j}\). Now er is indeed a decomposable composition of functions: At the final node(s), e is a function of ofinal (==ẑ) (by 14), and at every node j, \(o_j\) is a function of \(s_j\) (by 8), and \(s_j\) is a function of \(w_{i,j}\) (by 7). That's true for o, s, and w at the final layer given the feedforward values that got us there, and it's also true at every preceding layer, if we propagate the calculations of \( \large A_j = \frac{\partial e}{\partial o_j} \) from the final layer backward from LL to L0. So (using H Notation, mentioned above), for i, i' ∈ Ll-1, j ∈ Ll, we can use the chain rule:
\( \large e'_{l,i,j} = \frac{\partial e}{\partial w_{i,j}} = \) \( \large \frac{\partial e}{\partial o_j} \) * \( \large \frac{\partial o_j}{\partial s_j} \) * \( \large \frac{\partial s_j}{\partial w_{i,j}} \) (20)
├──┤ ├───┤ ├───┤
Aj Bj Cj
├──┤ ├───┤ ├───┤
See
below
\( o_j(1-o_j)\) oi

\(\large A_j, B_j, C_j\) are derived as follows (taking easiest first):

\(\large C_j=o_i\)
\( \begin{align*} \large C_j = \frac{\partial s_j}{\partial w_{i,j}} = \frac{\partial}{\partial w_{i,j}} \sum_{i'} w_{i',j}o_{i'} = \sum_{i'} o_i|_{i'=i}, 0|_{i'\ne i} = o_i \ \ \ \ \ \ (21) \end{align*}\)
\(\large θ(s)=\Large \frac{1}{1+e^{-s}}\)

As mentioned the logistic function θ typically used to calculate node outputs, \( o_j = θ(s_j) \), is
\(\large θ(s)=\frac{1}{1+e^{-s}} \ \ \ \ \ \ (22) \)
and has a convenient derivative
\(\large \frac{\partial θ(s)}{\partial s} = θ(s)(1-θ(s)) \ \ \ \ \ \ \ (23) \)
giving us for \(\large B_j\):
\(\large B_j=o_j(1-o_j)\)
\( \begin{align*} \large B_j = \frac{\partial o_j}{\partial s_j} = \frac{\partial θ(s_j)}{\partial s_j} = θ(s_j)(1-θ(s_j)) = o_j(1-o_j) \ \ \ \ \ (24) \\ \end{align*} \)
Our formulas will make much use of \( B_j = \frac{\partial o_j}{\partial s_j} = o_j(1-o_j) \).

Now \(\large A_j\) is the key to backpropagation; it is defined first at the end for final node(s), then at each preceding level it is built up using its previously calculated values from the next later level. The base case of this recursion is with j final:

\( \large A_j = \frac{\partial e}{\partial o_j} = \frac{\partial e}{\partial \hat z} = \hat z - z \ \ \ \ \ (cf\ 15) \)
Then for all the non-final cases, the recursion must add up the effects from each successor to j, that is, for each k ∈ Ll+1 a successor of j (we always assume j ∈ Ll), as follows:
\(\large A_j = \frac{\partial e}{\partial o_j} = \) \(\large \sum_{k} \) \( \large \frac{\partial e}{\partial o_k} \) * \( \large \frac{\partial o_k}{\partial s_k} \) * \( \large \frac{\partial s_k}{\partial o_j} \)   (25) (by the chain rule)
├──┤ ├──┤ ├──┤
Ak Bk F
Where:

\( \large A_k\):      was already calculated in layer l+1.

\( \large B_k = o_k(1-o_k)\)      Just as for \(\large B_j\).

\( \large F = \frac{\partial s_k}{\partial o_j} \) \(\large = \frac{\partial}{\partial o_j} ( \sum_{j'} o_{j'} w_{j',k} ) \)
\(\large = \sum_{j'} w_{j,k}|j'=j, \ \ \ \ 0|j'\ne j. \)
\(\large = w_{j,k} \)
Hence
\(\large A_j=\sum_{k} w_{j,k} B_k A_k\)
\(\begin{align*} \large A_j = \frac{\partial e}{\partial o_j} &= \sum_{k} F*B_k*A_k \ \ \ \ \ \ \ \ \ = \sum_{k} w_{j,k}\ o_k (1-o_k) \frac{\partial e}{\partial o_k} \\ &= \sum_{k} w_{j,k}\ o_k (1-o_k) A_k \ \ \ \ \ \ \ (26.0) \\ \end{align*}\).

Substituting back into (20):
\(\large e' = o_i B_j A_j\)
\(\begin{align*} \large e'_{i,j} &= \frac{\partial e}{\partial w_{i,j}} = \large C_j B_j A_j = C_j B_j \frac{\partial e}{\partial o_j} = C_j B_j \sum_{k} (F B_k A_k) \\ \large e' &= \large o_i o_j (1-o_j) \sum_{k}(w_{j,k} o_k (1-o_k) A_k) \ \ \ \ \ \ \ \ (26.1) \\ \end{align*} \)

Recursive and parallel implementation

The recursive definitions of \(A_j\) and \(A'_j\) in terms of their successors \(A_k\) and \(A'_k\) are what justifies the back propagation from the final node(s) to each layer of their predecessors.

We build the computation graph according to the following types of flows, with each node's intermediate values of s's and o's, etc., each making use of previous calculations.

Feed-forward Back-propagate
├──────┤ ├─────────────┤
a → o → s → ẑ → Ak ∈ Ll → Aj ∈ Ll−1 → .. → AL0
   →\(\large e'_{l,i,j} \) → \(w_{i,j}\)

Parallel computation can be achieved by dividing the training data according to the number of available processors and separately calculating \(\large \Delta_{i,j}\)'s or \(\frac{\partial e_r}{\partial w_{i,j}}\) for each \(e_r\), and then combining the results when completed by summing the Δ's across training data subsets.

Alternatively, processors can be assigned to subsets of the network, to make feedforward or backward calculations, as the data becomes available to their respective subset of the network. The results must be made available to the successor processors, keeping results separate for each observation. Then inputs can be processed both forward and backward not so much simultaneously as overlappingly, in pipelined, first-in-first-out fashion.

(Formal analysis of the algorithm will benefit from the concepts in George Mou's 1991 Yale CS thesis on parallel computation, for example the divide/combine index operations.)

Discussion

\(A_j = \hat z - z\) is often called the error signal, although it is the derivative of the error signal \(e=\frac{1}{2}(\hat z - z)^2\). As the derivative, it carries the direction toward the minimum of errors, not just the magnitude of the error.

Proceeding layer-by-layer this adjustment indicator propagated from the final layer all the way to the start layer, leaving behind evidence, that is, de/d\(o_j\), at each node j, so as to calculate the new value for \(w_{i,j}\) The update process follows the existing network structure: first as to computation flow; second by leaving the zero weights unchanged. Thus the network shape remains unchanged, the number of nodes at each layer remains unchanged, only the weights adjust.

The updates are scaled:

Backpropagation Intuitions

I like to think of the equation for Δ under a certain intuitive interpretation. If the threshold function θ() has its output bounded by [0,1], as is the case with the logistic threshold, \(o=\theta(s)=\large \frac{1}{1+e^{-s}}\), among others, the output of any node j, labelled \(o_j\), has the property

0 ≤ \(o_j\) ≤ 1
from which it follows that
0 ≤ \(o_j\)2 ≤ \(o_j\), and
0 ≤ \(o_j\)(1−\(o_j\)) = \(o_j\)−\(o_j\)2 ≤ \(o_j\).

It always helps to contemplate a graph of the relevant functions.

The Logistic:      L(s)=1/(1+exp(-s)), its square & slope 0 1 -1 2 -2 3 -3 4 -4 5 -5 1 L(s) L(s)*L(s) L(s)  -   L(s)*L(s)   =    L'

Consider illuminating cases: That is, if \(o_j\) is interpreted as a degree of certainty, then \(o_j\)*(1−\(o_j\)) = \(o_j\)−\(o_j\)2 is always ≥ 0, and always less certain than \(o_j\). The less certain is \(o_j\), the more dramatically uncertain is \(o_j\) − \(o_j\)2, even though that is always ≥ 0.

Define three degrees of uncertainty to apply to a passed-down error-derivative signal, as follows:

SortOf(j) ≜ \(o_j\)
SortOf ∘ SortOf(j) ≜ \(o_j\)2
Maybe(j) ≜ Bj = \(o_j\)−\(o_j\)2 = SortOf(j) − SortOf(SortOf(j))
UpError(k) ≜ de/dok

de/d\(w_{i,j}\) = de/d\(o_j\) * d\(o_j\)/d\(s_j\) * d\(s_j\)/d\(w_{i,j}\) = C * B * A
├──┤ ├───┤ ├───┤ ├───┤ ├───┤ ├───────────┤
A B C SortOf(i) Maybe(j) kwj,k*Maybe(k)*UpError(k)
├───┤ ├───┤
\(o_j\)*(1−\(o_j\)) oi

Thus the update signal Δ at a node j at level l is a diluted sum of the backpropagating error signals UpError(k) from successor nodes at level l+1, scaled up by the wj,k weights, but diluted by enhanced uncertainties on the links backward from different nodes k to the node j, and by enhanced uncertainty on the node j output \(o_j\), but only by the inherent uncertainty of the output of node i. Hence:

Δi,j = SortOf(i) * Maybe(j) * ∑k wj,k * Maybe(k) * UpError(k).
The certainty at layer l−1 is left alone, but at layers l and l+1 it is enhanced by squaring the uncertainty, and then combining these for all successors k of the i,j link, we make our estimate, Δ of the backpropagating error direction. Adjusting all the \(w_{i,j}\)'s in the opposite direction, to reduce the errors, with a sequence of run-throughs of the training data set, is how backpropagation brings increased overall certainty to its mapping of the chaos of the training set to something close to the desired outputs f(ar).

The Dunning Kruger effect is where a person who doesn't know, thinks he knows (more often a "he", indeed). Ignorance combined with not wanting to learn, would be an even worse combination. Backpropagation seems to follow an Anti-Dunning-Kruger approach.

When the error-slopes and errors are small or the certainty levels low, then the pre-existing mass of the weights at later layers move more slowly toward the earlier layers. On the other hand when the certainty levels are high and the errors and error-slopes are larger, then the overall mass of the later weights move most, and most quickly, toward the earlier layers. Thus learning is accelerated by high certainties in the feedforward analysis combined with a large error-derivative, defining across the weights taken as a whole a definite direction to improve towards.

This is like curiosity combined with intellectual humility: if I really thought I had it right, but I was wrong, that's when I most need to learn, and indeed I have the greatest opportunity to learn.

These factors of error magnitude and (forward) analysis certainty operate in a trading relationship, but the general idea that for re-estimating a weight \(w_{i,j}\) we leave alone, or incorporate directly, the degree of uncertainty taken from oi to be part of our re-estimations; whereas, we enhance the uncertainty (by subtracting its square) from the later outputs \(o_j\) and ok. Generally, the learning process in this system moves the mass of the later-layer weights wj,k into the earlier-layer weights \(w_{i,j}\), taking careful account of the uncertainties presently encoded, but moving more where the errors are larger.

I hope this helps you to have an intuitive understanding of the backpropagation concept.

Any comments are welcome and much appreciated.

| NN Top | NN Introduction | Newton's Method | Fuzzy Logic NNs | & training rule | Twist-and-Bend | Space Representing NNs | Dual Quaternions | Algebroids | Robots |

form outputs here. in comment_form() with from=bottom, uri=nn/nn.php

Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 2023 Thomas C. Veatch. All rights reserved.
Created: August 26, 2023