- After some background and motivation
- I give neural networks a fuzzy logic reinterpretation,
which seems to be a step in the direction of understanding them better.
- Then I propose a fuzzy-logic-enhanced neural network concept,
adding logic nodes to the network. The rules of operation of the
various fuzzy logic operators or nodes are cloned from wiser heads
than mine, with perhaps one improvement, or at least an opinion,
regarding the XOR operation.
- Then in the second page I go through the feedforward
application of FLENNs to inputs to generate outputs, and the
backpropagation of error derivatives comparing estimated with target
outputs, to improve the weights in the system.

Set Operations | Logic | Proof | |

complement | = | NOT | What's in the complement of the set is NOT in the set. |

intersection | = | AND | What's in the intersection of two sets is in one set AND in the other set. |

union | = | OR | What's in the union of two sets is in one set OR the other set OR both. |

inclusion | = | IMPLIES | One set being included in another means IF something is in the first, THEN it is in the second, or alternatively if it's not in the first, then it can be in the second or not in the second, but if it's in the first, it has to be in the second. (This finally justifies the awkward truth-table for IMPLIES, which has TRUE in the 3rd and 4th row, where the antecedent is FALSE.) |

Fuzzy logic is where truth has a value in the range [0..1], and where reasoning proceeds through combinations of assertions, just like set membership assertions, but by following fuzzy logic rules of combination for those [0..1] truth values.

P | \(\in [0..1]\) | |

Q | \(\in [0..1]\) | |

NOT | ¬P | \((1-P)\) |

AND | P∧Q | min(\(P,Q\)) |

OR | P∨Q | max(\(P,Q\)) |

XOR | P⊻Q | \(\frac{1}{2}(|P-Q|+1-|P+Q-1|)\)* |

IMPLIES | P→Q or Q ∨ ¬ P | max(\(1-P,Q\)) |

* XOR as \(|P-Q|\) gets the corners right but has an excessively depressed lane of zeroes down the middle at P=Q. \(1-|P+Q-1|\) also gets the corners right but has a lane of excessively elevated ones from \((1,0)\) to \((0,1)\). I prefer their average, \(\frac{1}{2}(|P-Q|+1-|P+Q-1|)\), which in the 4 triangular regions bounded by \(P=Q\) and \(P+Q=1\), clockwise from the top, are respectively \(1-P, 1-Q, P, Q\). |

Regular logic which has values 0 and 1 only, nothing in between, is called Crisp, in contrast with Fuzzy Logic, which encodes partial degrees of truth, you might say, uncertainty.

Fuzzy Logic systems are industrially useful in easily-written but highly perfected control systems. Trains in Japan, high-end high-speed elevators, accelerate so undetectably smoothly, even the water in an aquarium doesn't slop to the side. They are amazing; and when you really need perfect and continuous control, you use Fuzzy Logic.

It turns out that Japanese engineers are not put off by the implicit threat to one's masculinity carried by the word "Fuzzy". But in America, what He-Manly-man could ever work on, could ever be proud of, anything Fuzzy? It is too much to bear! But in a post-machismo age, we might also consider getting over it. So please, just consider Fuzzy to be foreign technical terminology. Transcribing into Azerbaijani, where Zadeh grew up, let's read it as "Fazi" Logic.

Fuzzy Logic is the foundation for specifically logical reasoning, not just under uncertainty, with ranges of truth, but under constrained surviveability or acceptability, with ranges of tolerance (as Zadeh's Berkeley friend, Hugh Ching has taught me). In solving a problem with a desired outcome, specifying a range of tolerance as well as fuzzy combinations of ranges of inputs, gives reliable results, and is also the responsible mode of reasoning. Consider, for example, the fuzzy logic assertion:

Survivable = Tolerable ∧ Possible.

This is self-evidently meaningful and true, without any data or probability distribution estimates. When outside the joint range of the tolerable and the possible, survival goes to zero. This is not a probabilistic statement, it is a logical one.

Probabilities INTEGRATE to 1 over their domain, while fuzzy truth values can BE 1 over unbounded domains.

Fuzzy truth is not about (the integral of) the probability of some crisply true outcome over some crisply specified range, but about statements or outcomes with variable degrees of truth, which may be combinations of other statements or outcomes, themselves of variable degrees of truth. These degrees of truth are specified crisply, and combinations of them are calculated with hard math to bound the outcome space.

In probability space, positive data rules. Probabilities must be tied to reality by assuming a probability distribution and extracting statistics from data under those assumptions, which re-parameterize the distributions in the model. Yes it's quite similar: both are machine learning methods.

But in fuzzy logic little to non-existent data, and close-to-boundary data, are also levers of learning. You don't learn more in fuzzy logic by training on many identical examples. One counterexample in left field can define a region, and the shapes of transitions between logical regions which overlap are the figure of merit in learning. In complex circumstances, fuzzy logic can learn faster, and pay more appropriate attention to what matters for reasoning about the inputs, than what we may call blind probabilistic approaches.

Much machine learning is strongly probabilistic. Hidden Markov Models, my favorite, are trained by maximizing the (log) likelihood of the training data given the model parameters, understanding those parameters as probabilities. Throw the data at the model given initial parameters, putting everything where it is most probable -- fits best -- then count up the counts of what ends up in each of the many states of your HMM model, convert to frequencies (over the sum at each branch point), and those are your new re-estimates of the probability parameters of the model. It's amazing and powerful, and it miraculously self-organizes to find the informational structure of the data, such that the data is more probable ("makes more sense") given the model. Great! But NNs have destroyed HMMs in speech recognition and now LLM tasks, and NNs do not have an immediate probabilistic interpretation.

A normal neural network threshold function like the logistic gives for
any node, \(j\), its output \(o_j \in [0..1]\). This allows us to
have an interpretion of node **outputs** using fuzzy logic
**semantics**. \(o_j\) can be considered as a degree of "yes"-ness
or fractional membership value in a node \(j\)'s implied fuzzy set or
category, exactly like a fuzzy-logic or fuzzy-set membership value,
such as 0 (out), 1(in) or some value in between for partly or
uncertainly in or out.

Hence, according to a Fuzzy Logic interpretation of neural networks, each node comes through training to represent some emergent category useful in the minimum-error information flow across the network, given the patterns intrinsic to the training data and given the network's structure.

A Neural Network's structure comprises: its number of layers, number of nodes per layer, and non-zero link weights forming interconnection patterns among the nodes, which in this interpretation are logical information flows.

Is it true that logical combinations and manipulations such as AND, OR, NOT, IMPLIES, and compositions thereof, can be calculated as the informational transformations in the feedforward part of a neural network, and that they can be trained according to the logical patterns in the training data? I conjecture so.

Consider this. Enhance a neural network by adding logical-combination nodes to each layer \(l\), as follows:

For the NOT connective,

for each layer, \(l\),

for each node, \(j\), within layer \(l\),

add a NOT node \(j_{NOT}\) to the enhanced layer \(l^+\), calculating its output as \(1-o_j\).For each connective AND OR XOR IMPLIES:

for each distinct pair of nodes \(j, j'\)both withinlayer \(l\)

add a new node j-j'_{CONNECTIVE}to the enhanced layer \(l^+\), calculating its output as follows:

j-j'_{AND}= MIN(\(o_j\), o_{j'})

j-j'_{OR}= MAX(\(o_j\), o_{j'})

j-j'_{XOR}= |\(o_j\)−o_{j'}|

j-j'_{→}= MAX(1-\(o_j\), o_{j'})

Such an enhanced neural network architecture would obviously be able to carry out (fuzzy) logical reasoning from its inputs (in feedforward operation).

(It would take two layers to apply multiple logical operators such as NOT AND, etc.)

( [¬] p | [¬] p ( ∧ | ∨ | XOR | → ) q | )

Enumerating, this little grammar has 2 + 2*4 = 10 pathways, which generate the following distinct expressions:

p, ¬ p, p ∧ q, ¬ p ∧ q, p ∨ q, ¬ p ∨ q, p XOR q, ¬ p XOR q, p → q, ¬ p → q,

which are also all the logical possibilities for one and two inputs.

In the context of an electronic or digital or fuzzy logic machine, we set it up take in only one or two bits or logical input values, each possibly taking the values (in the range between) 0 and 1, thus forming two or four logically separable value combinations -- or in the fuzzy domain, points within the [0,0]..[1,1] square. The machine then maps the one or two input values down to a single bit or fuzzy value, according to a logical specification, formula, or program. This specification of the mapping is itself a choice from among 16 possible patterns -- in the discrete, digital domain. That table of scenarios or list of patterns, formulas, or programs enumerates all of them, all combinations of 4 bits, or a region of space corresponding to the ranges which the 2 inputs can take, the square [0,0]..[1,1].

Those combinations are not the inputs or the outputs: each of the four bits in the BITS column of the table below is by itself a single requested output bit, associated with a specific pair of values in the two input bits (the pair in that column in rows 10 (for Q) and 12 (for P). The row, expression, formula, or program (synonyms here) tells the machine, taking one or two actual input bits, to produce one output bit, just a single 1 or 0, but this specified output depends on whether \(p\) or \(q\) is one or zero, or both, or neither, according to the BITS column in the specified row.

Therefore this is a universal logic engine: all possibilities are
covered. It is also a convenient and useable logic engine: each of
the possible patterns are simply expressible by the above grammar. We
can specify any pattern of input/output mappings by writing an
expression with at most two connectives, to combine those two inputs
(p: #12, q: #10) **whatever** they may be, to yield a single output
according to the specified pattern. Observe:

Decimal | Hex | BITS | Expression |

0 | 0 | 0 0 0 0 | Excluded since independent of inputs |

1 | 1 | 0 0 0 1 | ¬ (p ∨ q) |

2 | 2 | 0 0 1 0 | ¬ (q → p) |

3 | 3 | 0 0 1 1 | ¬ p |

4 | 4 | 0 1 0 0 | ¬ (p → q) |

5 | 5 | 0 1 0 1 | ¬ q |

6 | 6 | 0 1 1 0 | p XOR q |

7 | 7 | 0 1 1 1 | ¬ (p ∧ q) |

8 | 8 | 1 0 0 0 | p ∧ q |

9 | 9 | 1 0 0 1 | ¬ (p XOR q) |

10 | A | 1 0 1 0 | q |

11 | B | 1 0 1 1 | p → q |

12 | C | 1 1 0 0 | p |

13 | D | 1 1 0 1 | q → p |

14 | E | 1 1 1 0 | p ∨ q |

15 | F | 1 1 1 1 | Excluded since independent of inputs |

I did some homework here. If you like you can go through the truth table operations for the given expressions and you will find the meaning of the expression indeed maps all the values p and q could take to the expressed pattern. Because 2*2*2*2 = 16, actually, all possibilities can be enumerated, and the results exhaustively checked, so that the system is conclusively comprehensive and infallible. It seems somewhat minimal too, limiting the connectives to two, with one being a NOT and the other being a single option out of four. Aristotle might propose a different set of connectives to reach the same combinations; that would be fine too. I'll go with this one, since it at least doesn't suck.

Now, if a neural network were enhanced with these five operations NOT, AND, OR, IMPLIES, and XOR, then it would require no more than two layers to calculate any logically possible derived combination of any given (enhanced) inputs.

Thus we have made some progress here. Now we can hope for fuzzy logic to be implemented not just by manual programming of control systems, but in machine learning applications as Fuzzy Logic Enhanced Neural Networks, or FLENNs ("flense": to skin a whale).

I claim that Neural Networks can themselves better be understood if they are not just understood as constituting fuzzy logical categorization systems at each regular node, but also if they are actually enhanced structurally and operationally to include, in the form of fuzzy logic nodes, definite and potentially interpretable fuzzy classifiers and reasoning systems. That is pretty interesting and exciting, and calls out for some experimental work to prove it out and see if this approach also has benefits as to performance.

A NOT might be considered similar to a large negative weight, since that would translate a 1 at the previous level to a large negative contribution to the sum of the next level, which push the threshold output toward 0. So far so good. But NOT also translates 0 to 1. Sending something approaching zero from one node to another, multiplying it by a large negative weight (large \(\times\) 0 = 0) yields something approaching θ(0) = 1/2, which is not the positive knowledge that the opposite is true, but failure to know anything. So this would be a NOT-1-OR-UNKNOWN, not, indeed, a NOT, at all.

In general, the way to do this is to get the weighted sum of two input nodes to be large and positive where you want their combination to produce a 1 on the output, because θ(large)→1, and for the sum to be large and negative (not zero!) where you want the combination to produce a 0 on the output, because θ(large-negative)→0. It's a bit maddening, but we can certainly do it.

Now let's suppose our node j's layer \(l\) is supplemented with a special node C emitting a constant output value \(o_C=1\), with a learnable weight \(w_{C,k}\) to some next-layer node k for which we hope \(o_k = \neg j = 1-o_j\). Learning \(w_{C,k}\) will be no different from learning any other weight, according to the usual gradient descent or successive linear approximation methods, assuming the errors propagating back to C push it in the direction of this pattern of weights, in order to reduce the errors.

Now suppose the system learns a very large negative weight for \(w_{j,k}\) and an approximately half-magnitude, but still large positive weight for \(w_{C,k}\). Let's say \(c\) is large and positive and \(w_{j,k} = -2c\) and \(w_{C,k} = c\).

Then our successor node \(k\), ignoring other inputs, has in the "true" case of \(o_j\)→1, in the limit, the weighted sum \(s_k = 1\times w_{C,k}+1\times w_{j,k} = c - 2c = -c\) which is large and negative; θ(-c)→0 which is the output we want \(o_k\)→0. In the "false" case with P→0, in the limit, k's weighted sum \(s_k = 1\times w_{C,k}+0\times w_{j,k} = c+0 = c\) which is large and positive, so the output \(\theta(c)=o_k\)→1. This captures some essence of mapping P to ¬P.

It's not our fuzzy logic rule \(\neg P = 1 - P\), but let's call it a neural-network-style NOT rule, which reverses its input and, according to the magnitude of \(c\) makes the transition between 0 and 1 sharper or more gradual, and according to the ratio between \(w_{C,k}\) and \(w_{j,k}\) adjusts how far up from \(o_j=0\) to \(o_j=1\) begins the transition.

We don't know if this is even learnable. If it is learnable, then we still don't know if it is better in experiments than a fuzzy logic rule that doesn't demand that these semi-linear methods learn corner shapes - that would be a nice experiment. But we have shown here, that implementing something like negation in neural networks is not impossible, for we have seen it.

In the case of AND, the sums of \((0,0), (1,0), (0,1), (1,1)\) are \(0, 1, 1, 2\) respectively, so the cut point needs to be between 1 and 2, let's say 1.5 or 3/2. So let's use C's output weight to shift the weighted sum of the others down, so that a zero value for the weighted sum lies between the cases we want to separate.

So let's have four nodes, C, P and Q at layer \(l\), and k at layer \(l+1\), with weights \(w_{C,k}, w_{P,k}, w_{Q,k}\). Outputs are \(o_c=1\), \(o_P\in [0..1], o_Q\in [0..1]\). The weighted sum in each case is

\(s_k = o_C w_{C,k} + o_P w_{P,k} + o_Q w_{Q,k}\).Now to implement AND, set \(w_{P,k} = w_{Q,k} = c, c\) large and positive (so \(\frac{c}{2}\) is also large and positive). Then set \(w_{C,k} = -\frac{3}{2}c\). Then

\(s_k = 1\times w_{C,k} + o_P\times c + o_Q\times c = -\frac{3c}{2} + o_P\times c + o_Q\times c\).In case P and Q are 1, \(s_k = -\frac{3c}{2}+c+c = \frac{c}{2}\), which is still large and positive, and \(o_k = \theta(\frac{c}{2})\)→ 1.

In case either P or Q is 0, \(s_k = -\frac{3c}{2}+c = -\frac{c}{2}\), which is large and negative, so \( o_k = \theta(-\frac{c}{2})\)→0.

Just to confirm, if both P and Q are 0, \(s_k = -\frac{3c}{2}\), which is even more large and negative, so still \(o_k = \theta(-\frac{3c}{2})\)→ 0.

Bingo. We have implemented something like an AND by setting NN weights. It gets the corner cases right; it allows us to more sharply or more slowly ramp the transition at our chosen centerpont of \(\frac{3}{2}\) according to size of \(c\); and it allows us even to shift that centerpoint closer to (1,1) or closer to (1,0),(0,1), if we want, so it has some flexibility. On the other hand, this approach shoves the double-nought case double-far into the exponentially-close-to-zero corner, which isn't exactly the equal treatment expected from a true AND concept. In the log domain, this would be a disaster, but since we are using addition, as we do right away in going to the next layer of nodes, it may be okay. A loose zero versus a tight zero won't make much difference when scaled and summed at the next level, their contribution will again be close to zero. Is this a problem? Do neural networks ignore FALSE values? A weighted sum including a zero could just as well exclude it. Apparently they do!

This is worth deeper thought. All the weights are importances of their inputs, even negative importances, but a zero output value from a previous layer says, please ignore me. Evidently Neural Networks implement a kind of one-sided logic, only paying attention to positive examples, and if a classifier node says this input is OUTSIDE my classification zone (by having an output \(o=0\)), then the rest of the neural network immediately goes about completely ignoring that important bit of information. Our NOT trick above, with extra constant nodes and weights, or a Fuzzy-Logic-Enhanced approach, also with an extra node and weight for each combination of inputs, might be quite the helpful enhancement for a more intelligent neural network.

What about OR?

Now we want to shift the weighted sum of \(o_P,o_Q\) down so
*most* of it is above zero, and does so as a weighted sum when
either \(o_P\) or \(o_Q\) approaches one:

Set \(w_{C,k} = -\frac{c}{2}\) retaining \(w_{P,k} = w_{Q,k} = c, c\) large and positive. Then

Where | sum | output | |

P=1,Q=1: | \(s_k = -\frac{c}{2} + c + c\) | \(=\frac{3c}{2}\) | \(o_k =\theta(\frac{3c}{2})\)→1 |

P=1,Q=0: | \( s_k = -\frac{c}{2} + c \) | \(=\frac{c}{2}\) | \(o_k =\theta(\frac{c}{2})\)→1 |

P=0,Q=1: | \( s_k = -\frac{c}{2} + c \) | \(=\frac{c}{2}\) | \(o_k =\theta(\frac{c}{2})\)→1 |

P=0,Q=0: | \( s_k = -\frac{c}{2} \) | \(=-\frac{c}{2}\) | \(o_k =\theta(-\frac{c}{2})\)→0 |

Now what about IMPLIES (and XOR)?

Here's a false path. A positive weight from a node at one level to a successor node at the next level, might be taken to mean something like IMPLIES.

There are the four cases:

P | Q | P→Q |

T | T | T |

T | F | F |

F | T | T |

F | F | T |

None of these involves two nodes; they all involve three: the two inputs, and the one output which is the value of P→Q given the two input values for P and for Q. So successive node activation in neural nets is not a model of logical implication.

Think about it.

IF the predecessor node is substantially true, THEN the successor node will be substantially influenced toward being true by a large positive weight between them. So far so good. True leads to true, which sounds a lot like logical implication.

But IF the predecessor node is substantially false, its output is close to zero, THEN the successor node will basically ignore the predecessor, since zero times whatever weight links them is still close to zero, so that predecessor will add little or nothing to the sum of inputs that the successor will spark off on if the sum is large enough. It doesn't drive the successor to false but to ignorance: θ(0)= 0.50, which means Don't Know, on the scale of [0..1] mapping to [No .. Yes].

A forward link is not a logical IMPLIES relationship.

So let's try again. Another way to ask the question is, Can plain Neural Networks encode subset relationships?

The logical IMPLIES operation is really an encoding for a subset relationship. For example, since dogs are a subset of mammals, then dog(x) → mammal(x); and dog(x) → mammal(x) is always true in a world in which dogs are (a subset of) mammals. So subset and → are essentially equivalent.

Logical IMPLIES wants the output to be true when the (first) input is false; and that makes more sense to me when I think of it as a subset relationship: P → Q means P is a subset of Q. A proposition P(x)→Q(x) is false only when the thing x is claimed to be in the subset, but it is NOT in the superset. Did you process that? I wrote it, and understood it when I composed the sentence, but by the time I wrote it down, my eyes were crossed. Re-reading after a nap, it makes sense again.

Yes. P IMPLIES Q can be implemented as Q OR NOT P, which in our concept requires a second layer. The first layer feeds forward from \(l\) to \(l+1\) to convert a value \(P\in [0..1]\) to \(\neg P \in [1..0]\) and the second layer from \(l+1\) to \(l+2\) to convert \(\neg P \in [1..0]\) along with \(Q\in [0..1]\) to something like \(\neg P \vee Q\) ⌶ \(P\)→\(Q\). (Notation here)

Similarly, P XOR Q can be implemented various ways. In Fuzzy Logic I like the average of |P-Q| and 1-|P+Q-1|, which is nice and linear, also fast in integer CPUs with few adds, two sign-drops, and a bit-shift (to divide by 2). Could we do all that, and subtract 1/2, through the veil of intervening weights, in CENN's? Maybe, I haven't worked it out.

But certainly, in constant-enhanced Neural Nets, we can use \((P\vee Q) \wedge \neg(P\wedge Q)\) to build that in, across 3 layers.

So to conclude: No, we don't **need** Fuzzy Logic Enhancements for
there to exist the possibility of logical operations in a Neural
Network.

Let's call them CENNs, Constant Enhanced Neural Networks. To a vanilla neural network, we add one constant-output node per layer, and a row in the weights matrix between that layer and the next, sparsely or fully populated with weights to every successor of another node or pair of nodes in the layer that might, with learning, be usefully subjected to a logical operation.

We don't know if this is even learnable. If it is learnable, then we still don't know if it is better in experiments than a fuzzy logic rule that doesn't demand that these semi-linear methods learn corner shapes - that would be a nice experiment. But we have shown here, that implementing something like negation, conjunction, disjunction, implications, and even the exclusive-or, within nearly-vanilla Neural Networks is not impossible, for we have seen it.

In this dimension of interpretability, it is the Fuzzy-Logic Enhancements which win over CENNs. In FLENNs, the weight from a logical-combination node to anything in the succeeding layer, if it develops a large value, directly asserts the fuzzy-logical meaning that the combination node implements, whether that's an AND, OR, NOT, IMPLIES, or XOR. We labelled the node when we made it, and we set up its deterministic, non-learnable formula to combine its specific one or two inputs. So we know. What the system learns is whether to put any weight on the output of that logical combination, whether it is usefully meaningful as the layers feed forward to reduce the total errors. It doesn't construct logicality by learning weights which combine in special ways to make logical inferences, as in the CENN approach. No, FLENNs detect the value of logicality, by putting a substantial weight onto a designer-defined logical combination.

In contrast, how many rounds must CENN training run for, so that useful logical combinations can be trained up? It seems a lot more adjusting would be needed, to find values for \(c\), to balance them between the inputs P and Q, to somehow get the Constant node's output to be \(c\) times the right scale factor for AND or OR or NOT. Weight space is large, and these are small subsets that must be found, and it is not clear that getting closer to useful values will produce much in the way of error reduction or gradient, especially considering possible countervailing effects in the data. We may only hope that gradient descent or successive linear approximation will encounter them in any finite number of training generations.

Experiments may be expected to confirm or dash such hopes.

Weights very close to zero could have a tremendous logical significance in machine learning (outside NNs). If an ML method were to combine information across a graph by using multiplication, which is log-domain addition, then tiny fractions, which map in the log domain to large negative log values, have outsize effects. However in Neural Networks, weights close to zero, multiplied with the largest of input values (≤1), reduces the influence of the predecessor's output to nothing. The weighted output has no influence on the sum of the next layer's inputs, because when you add zero to a sum, you don't change the sum. It is as if this node wasn't even there, for that successor. Learning a near-zero weight means learning to ignore an input.

It's not a NOT OR AND IMPLIES or XOR, it's an IGNORE.

Inverses are not importances. They don't allow us to invert our calculations, as on a log scale they might be able to do.

We reason from large weights to importance of the predecessor's
classificatory information in the positive detection of this
node's class. We assume that a linear combination of
predecessors will get us the right result. This means a
lollapalooza effect, where many inputs pointing in the same
direction are added together to get a super-strong inference of
that class. It also means fungibility, which is the effect that
some of This can be understood as counting the same some of That
Other, in the ratio of their weights, so that the sum can be the
same whether it got its input from here or there. It doesn't
matter which one the news came from, they count the same.
Further, we allow large **negative** weights to cancel out the
positive information gleaned other nodes, but again by sum, not
by product. (If by product, then small fractions would have the
effect of large negatives in this additive domain.) Here even
negatives are fungible: they apply to the same inferred
conclusion which is simply the resulting sum at the end of the
summation of weighted inputs, and less of some positives is taken
to count as the same as more of this negative. That's simply how
\(\large +\) works, or here more specifically,
\(s_j=\sum_i o_i w_{i,j}\).

On the other hand, when we reason about weights that are inverses
of large numbers, that is, close-to-zero weights, it says that
this is **not important to that**. When This has an output
that is close to one, and the weight from This node to That node
multiplies by a weight that is close to zero, it results in a
lack of influence on the successor. Plenty of zeroes all through
the sum will leave \(s_j\) still zero-ish, and θ(\(s_j\))
falls in the ignorant, know-nothing middle at 1/2.

So weights close to zero, fractions, inverses of large weights,
should be thought of as **unimportances**, as ignoreabilities,
not as positive information that something is not there, nor as
the certainty even that the following is unknown. Because its
zero-ness gets immediately lost if any other feeder of the
downstream process sends forth a weighted output greater than
zero. Zeroes, zero weights, zero outputs, are simply ignored. Nothing
detects their presence, without specialty enhancements, as we have
discussed.

If weights get close to zero, maybe at some point in training they should be floored to zero, since they are that unimportant; and then the system can be simpler and faster and hardly any less effective. We let them hang around only because they might pick up some utility later on in learning, and grow to be important again. When done learning, we could probably zero them out.

In my 1987 unpublished independent study of hidden Markov models, I re-estimated model parameters of randomly-initialized, N-state, fully-connected HMMs to best fit a training data set of raw text data. With N=2, the probabilities of the two states for each letter amazingly picked out the vowels as high-probability on one state, and the consonants on the other, as a generalization. The categories were not perfect; they seemed to do a best-fit job of making the most of the statistical patterns in the input data, and it was rather along the lines of a miracle that something very close to natural phonotactic categories emerged automatically from such a general statistical leverage optimizer. With N=3, punctuation dominated one of the states. With N=4 and N=5 there were syllable-position-sensitive categorizations, like /s/ is initial in syllable-onsets, and final in syllable codas, etc. At higher values of N, it didn't seem to make a lot of sense what the emerging categories might be. The way to think about it was that each state optimized for simply maximizing the joint probability of the whole observation sequence, irrespective of any preferred linguistic categories. The data rules; preconceived linguistic categories be damned.

This experience translates to the neural network case quite analogously. Each node can be considered as fuzzy-logic category detector, but the meaning of the emergent categories which the training process gradually creates, is anyone's guess, and a not-exactly probabilistic combination of multiple known categories, or a not-exactly probabilistic subdivision of a single known category according to how it best fits the data, are just a couple of ways that backpropagation might generate mysterious or incomprehensible, yet useful, emergent categories, along with their own special logic using linear transformation under non-linear thresholding, that can combine and manipulate them arbitrarily.

However, enhanced by Fuzzy-Logic nodes representing Fuzzy-Logic combinations of pairs of nodes at a given layer, computational neural networks can be made capable of learning interpretable and hopefully effective and useful, economically represented and efficiently-calculated, Fuzzy-Logical classifications and transformations at key places in their information flow from inputs to estimated outputs. A weights editor will be able to tell you what logic is being done in such an enhanced system, and you may then hope to see inside the mind of its artificial and effective intelligence. Then we can discover the logical truth, the insight-bearing informational structure of whatever problems we are studying.

In conclusion, Fuzzy Logic can enhance neural networks by providing more interpretability and easier training of specifically logical combinations. But Neural Networks also enhance Fuzzy Logic, by providing machine learning methods to make Fuzzy Logic systems learnable based on data, including big data. So although FL can be considered an enhancement to NNs, it goes the other way too: NNs offer straightforward training to the FL concept. For example, perhaps a single training example of a robot arm dragged manually through its working trajectory, can be captured and converted into an FL control system. Or, we may track and collect mechanical skills by animals or humans, and optimize FLENNs, to train them to implement that skill set. One or few examples may be enough, with judiciously chosen mapping to time and space and control systems. Multi-layer connections may be used, one for the control pattern, another to self contextualize or to learn one’s body, and learn one's environment. A perception/action loop must find the threads of the learned concepts in the percepts, and generate coordinated, modelled response concepts in its provided body.