Neural Networks as Fuzzy Logic Engines

which would be amazing


| NN Top | Introduction | Newton's Method | Fuzzy Logic NNs | & training rule | Twist-and-Bend | Space Representing NNs | Dual Quaternions | Algebroids | Robots |

| "Fuzzy" | Introduction | Fuzzy Logic | Fuzzy ≠ Probabilistic | FL Enhanced NNs | Sufficiency | Fuzzy Already | Open Questions | Near Zero | Conclusion |

What's in this Name?

The name Fuzzy unfortunately induces automatic disrespect for these concepts by machismo-blinded Americans. Teddy bears are fuzzy, not big manly science and engineering concepts. If that's the case for you, please replace "Fuzzy" by "Fractional Truth", and in acronyms "F" by "FT" or simply "F". If you prefer to read FL as "Fractional Truth Logic", everything will conveniently remain the same.

Introduction

Neural networks are normally taken as black boxes. Somehow they do pretty well at predicting the outputs given the inputs. It's a mystery, though, what the nodes mean, what the weights mean: these are not much studied and discussed. Post-training empirical studies of the meanings of nodes in a neural network are hardly to be found (are there any?). In this page and its associated model-training discussion ...: It seems likely to me that the FLENN concept offers a much greater possibility of actually understanding the emergent intelligences that neural network training can create. Not least, a comparison between regular neural network approximations of the logical operations in a FLENN suggests a ton of random and frankly unstable and noise-influenced exploration would be required to give a NN much logic, while a FLENN should capture logical essences really quite quickly and robustly, without endlessly wandering off and losing track and getting confused by crossing signals. I'm hopeful that this work may prove to be a contribution to the machine learning and AI literature, and I solicit your review, criticisms, and comments. Thank you in advance.

Fuzzy Logic

The great Lotfi Zadeh's seminal paper, "Fuzzy Sets" (1965), pointed out that fuzzy set operations can be defined for set complement, inclusion, union, and intersection. Notice that these set operations correspond exactly to the logical operations NOT, IMPLIES, OR, and AND, respectively: they have the same meaning.

Set OperationsLogicProof
complement = NOT What's in the complement of the set is NOT in the set.
intersection = AND What's in the intersection of two sets is in one set AND in the other set.
union = OR What's in the union of two sets is in one set OR the other set OR both.
inclusion = IMPLIES One set being included in another means IF something is in the first, THEN it is in the second, or alternatively if it's not in the first, then it can be in the second or not in the second, but if it's in the first, it has to be in the second. (This finally justifies the awkward truth-table for IMPLIES, which has TRUE in the 3rd and 4th row, where the antecedent is FALSE.)

Fuzzy logic is where truth has a value in the range [0..1], and where reasoning proceeds through combinations of assertions, just like set membership assertions, but by following fuzzy logic rules of combination for those [0..1] truth values.

P \(\in [0..1]\)
Q \(\in [0..1]\)
NOT¬P \((1-P)\)
ANDP∧Q min(\(P,Q\))
ORP∨Q max(\(P,Q\))
XORP⊻Q \(\frac{1}{2}(|P-Q|+1-|P+Q-1|)\)*
IMPLIESP→Q or Q ∨ ¬ P max(\(1-P,Q\))
* XOR as \(|P-Q|\) gets the corners right but has an excessively depressed lane of zeroes down the middle at P=Q.
    \(1-|P+Q-1|\) also gets the corners right but has a lane of excessively elevated ones from \((1,0)\) to \((0,1)\).
    I prefer their average, \(\frac{1}{2}(|P-Q|+1-|P+Q-1|)\), which in the 4 triangular regions
    bounded by \(P=Q\) and \(P+Q=1\), clockwise from the top, are respectively \(1-P, 1-Q, P, Q\).

Regular logic which has values 0 and 1 only, nothing in between, is called Crisp, in contrast with Fuzzy Logic, which encodes partial degrees of truth, you might say, uncertainty.

Fuzzy Logic systems are industrially useful in easily-written but highly perfected control systems. Trains in Japan, high-end high-speed elevators, accelerate so undetectably smoothly, even the water in an aquarium doesn't slop to the side. They are amazing; and when you really need perfect and continuous control, you use Fuzzy Logic.

It turns out that Japanese engineers are not put off by the implicit threat to one's masculinity carried by the word "Fuzzy". But in America, what He-Manly-man could ever work on, could ever be proud of, anything Fuzzy? It is too much to bear! But in a post-machismo age, we might also consider getting over it. Perhaps we might consider Fuzzy to be foreign technical terminology. Transcribing into Azerbaijani, where Zadeh grew up, let's read it as "Fazi" Logic. Or as I suggested above, use the same acronym, FL, but the replacement name, "Fractional-truth Logic".

Fuzzy Logic is the foundation for specifically logical reasoning, not just under uncertainty, with ranges of truth, but under constrained surviveability or acceptability, with ranges of tolerance (as Zadeh's Berkeley friend, Hugh Ching has taught me). In solving a problem with a desired outcome, specifying a range of tolerance as well as fuzzy combinations of ranges of inputs, gives reliable results, and is also the responsible mode of reasoning. Consider, for example, the fuzzy logic assertion:

Survivable = Tolerable ∧ Possible.

This is self-evidently meaningful and true, without any data or probability distribution estimates. When outside the joint range of the tolerable and the possible, survival goes to zero. This is not a probabilistic statement, it is a logical one.

Fuzzy ≠ Probabilistic

Fuzzy truth is different from probabilistic truth (which also ranges from 0 to 1), and from entropy or information-theoretic truth, which assert probabilities or likelihoods over unbounded domains.

Probabilities INTEGRATE to 1 over their domain, while fuzzy truth values can BE 1 over unbounded domains.

Fuzzy truth is not about (the integral of) the probability of some crisply true outcome over some crisply specified range, but about statements or outcomes with variable degrees of truth, which may be combinations of other statements or outcomes, themselves of variable degrees of truth. These degrees of truth are specified crisply, and combinations of them are calculated with hard math to bound the outcome space.

In probability space, positive data rules. Probabilities must be tied to reality by assuming a probability distribution and extracting statistics from data under those assumptions, which re-parameterize the distributions in the model. Yes it's quite similar: both are machine learning methods.

But in fuzzy logic little to non-existent data, and close-to-boundary data, are also levers of learning. You don't learn more in fuzzy logic by training on many identical examples. One counterexample in left field can define a region, and the shapes of transitions between logical regions which overlap are the figure of merit in learning. In complex circumstances, fuzzy logic can learn faster, and pay more appropriate attention to what matters for reasoning about the inputs, than what we may call blind probabilistic approaches.

Much machine learning is strongly probabilistic. Hidden Markov Models, my favorite, are trained by maximizing the (log) likelihood of the training data given the model parameters, understanding those parameters as probabilities. Throw the data at the model given initial parameters, putting everything where it is most probable -- fits best -- then count up the counts of what ends up in each of the many states of your HMM model, convert to frequencies (over the sum at each branch point), and those are your new re-estimates of the probability parameters of the model. It's amazing and powerful, and it miraculously self-organizes to find the informational structure of the data, such that the data is more probable ("makes more sense") given the model. Great! But NNs have destroyed HMMs in speech recognition and now LLM tasks, and NNs do not have an immediate probabilistic interpretation.

FLENN

Can Neural Networks be considered as learning Fuzzy Logic Engines? What an appealing idea! Consider how close we are already!

A normal neural network threshold function like the logistic gives for any node, \(j\), its output \(o_j \in [0..1]\). This allows us to have an interpretion of node outputs using fuzzy logic semantics. \(o_j\) can be considered as a degree of "yes"-ness or fractional membership value in a node \(j\)'s implied fuzzy set or category, exactly like a fuzzy-logic or fuzzy-set membership value, such as 0 (out), 1(in) or some value in between for partly or uncertainly in or out.

Hence, according to a Fuzzy Logic interpretation of neural networks, each node comes through training to represent some emergent category useful in the minimum-error information flow across the network, given the patterns intrinsic to the training data and given the network's structure.

A Neural Network's structure comprises: its number of layers, number of nodes per layer, and non-zero link weights forming interconnection patterns among the nodes, which in this interpretation are logical information flows.

Is it true that logical combinations and manipulations such as AND, OR, NOT, IMPLIES, and compositions thereof, can be calculated as the informational transformations in the feedforward part of a neural network, and that they can be trained according to the logical patterns in the training data? I conjecture so.

Consider this. Enhance a neural network by adding logical-combination nodes to each layer \(l\), as follows:

For the NOT connective,
    for each layer, \(l\),
       for each node, \(j\), within layer \(l\),
          add a NOT node \(j_{NOT}\) to the enhanced layer \(l^+\), calculating its output as \(1-o_j\).

For each connective AND OR XOR IMPLIES:
   for each distinct pair of nodes \(j, j'\) both within layer \(l\)
       add a new node j-j'CONNECTIVE to the enhanced layer \(l^+\), calculating its output as follows:
             j-j'AND = MIN(\(o_j\), oj')
             j-j'OR   = MAX(\(o_j\), oj')
             j-j'XOR = |\(o_j\)−oj'|
             j-j'    = MAX(1-\(o_j\), oj')

Such an enhanced neural network architecture would obviously be able to carry out (fuzzy) logical reasoning from its inputs (in feedforward operation).

(It would take two layers to apply multiple logical operators such as NOT AND, etc.)

Logical Sufficiency of FL

These logical connectives were chosen so they can express all possible combinations of two inputs p, q, via the following logical grammar, where () collects, | separates choices, [] indicates optional inclusion or deletion):

  (
     [¬] p |
     [¬] p ( ∧ | ∨ | XOR | → ) q |
  )

Enumerating, this little grammar has 2 + 2*4 = 10 pathways, which generate the following distinct expressions:

p, ¬ p, p ∧ q, ¬ p ∧ q, p ∨ q, ¬ p ∨ q, p XOR q, ¬ p XOR q, p → q, ¬ p → q,

which are also all the logical possibilities for one and two inputs.

In the context of an electronic or digital or fuzzy logic machine, we set it up take in only one or two bits or logical input values, each possibly taking the values (in the range between) 0 and 1, thus forming two or four logically separable value combinations -- or in the fuzzy domain, points within the [0,0]..[1,1] square. The machine then maps the one or two input values down to a single bit or fuzzy value, according to a logical specification, formula, or program. This specification of the mapping is itself a choice from among 16 possible patterns -- in the discrete, digital domain. That table of scenarios or list of patterns, formulas, or programs enumerates all of them, all combinations of 4 bits, or a region of space corresponding to the ranges which the 2 inputs can take, the square [0,0]..[1,1].

Those combinations are not the inputs or the outputs: each of the four bits in the BITS column of the table below is by itself a single requested output bit, associated with a specific pair of values in the two input bits (the pair in that column in rows 10 (for Q) and 12 (for P). The row, expression, formula, or program (synonyms here) tells the machine, taking one or two actual input bits, to produce one output bit, just a single 1 or 0, but this specified output depends on whether \(p\) or \(q\) is one or zero, or both, or neither, according to the BITS column in the specified row.

Therefore this is a universal logic engine: all possibilities are covered. It is also a convenient and useable logic engine: each of the possible patterns are simply expressible by the above grammar. We can specify any pattern of input/output mappings by writing an expression with at most two connectives, to combine those two inputs (p: #12, q: #10) whatever they may be, to yield a single output according to the specified pattern. Observe:

DecimalHex   BITS   Expression
000 0 0 0Excluded since independent of inputs
110 0 0 1¬ (p ∨ q)
220 0 1 0 ¬ (q → p)
330 0 1 1¬ p
440 1 0 0¬ (p → q)
550 1 0 1¬ q
660 1 1 0p XOR q
770 1 1 1¬ (p ∧ q)
881 0 0 0p ∧ q
991 0 0 1¬ (p XOR q)
10A1 0 1 0q
11B1 0 1 1p → q
12C1 1 0 0p
13D1 1 0 1q → p
14E1 1 1 0p ∨ q
15F1 1 1 1Excluded since independent of inputs

(The 6 reasons there were 10 above instead of 16 are because (1) p and q, (2) ¬p and ¬q, and (3) p→q and q→p (and (4) their negations) were not separated, plus the exclusions 0 and 15 mentioned in the table (5), (6).)

I did some homework here. If you like you can go through the truth table operations for the given expressions and you will find the meaning of the expression indeed maps all the values p and q could take to the expressed pattern. Because 2*2*2*2 = 16, actually, all possibilities can be enumerated, and the results exhaustively checked, so that the system is conclusively comprehensive and infallible. It seems somewhat minimal too, limiting the connectives to two, with one being a NOT and the other being a single option out of four. Aristotle might propose a different set of connectives to reach the same combinations; that would be fine too. I'll go with this one, since it at least doesn't suck.

Now, if a neural network were enhanced with these five operations NOT, AND, OR, IMPLIES, and XOR, then it would require no more than two layers to calculate any logically possible derived combination of any given (enhanced) inputs.

Thus we have made some progress here. Now we can hope for fuzzy logic to be implemented not just by manual programming of control systems, but in machine learning applications as Fuzzy Logic Enhanced Neural Networks, or FLENNs ("flense": to skin a whale).

I claim that Neural Networks can themselves better be understood if they are not just understood as constituting fuzzy logical categorization systems at each regular node, but also if they are actually enhanced structurally and operationally to include, in the form of fuzzy logic nodes, definite and potentially interpretable fuzzy classifiers and reasoning systems. That is pretty interesting and exciting, and calls out for some experimental work to prove it out and see if this approach also has benefits as to performance.

Pre-existing FL features of NNs

A critic might argue that neural networks can already emulate some or all of this logicality already, so why use FLENNs? Let's take a look.

NOT

Can we engineer a way to get \(1-o_j\) out of a node in the regular course of neural networks?

A NOT might be considered similar to a large negative weight, since that would translate a 1 at the previous level to a large negative contribution to the sum of the next level, which push the threshold output toward 0. So far so good. But NOT also translates 0 to 1. Sending something approaching zero from one node to another, multiplying it by a large negative weight (large \(\times\) 0 = 0) yields something approaching θ(0) = 1/2, which is not the positive knowledge that the opposite is true, but failure to know anything. So this would be a NOT-1-OR-UNKNOWN, not, indeed, a NOT, at all.

In general, the way to do this is to get the weighted sum of two input nodes to be large and positive where you want their combination to produce a 1 on the output, because θ(large)→1, and for the sum to be large and negative (not zero!) where you want the combination to produce a 0 on the output, because θ(large-negative)→0. It's a bit maddening, but we can certainly do it.

Now let's suppose our node j's layer \(l\) is supplemented with a special node C emitting a constant output value \(o_C=1\), with a learnable weight \(w_{C,k}\) to some next-layer node k for which we hope \(o_k = \neg j = 1-o_j\). Learning \(w_{C,k}\) will be no different from learning any other weight, according to the usual gradient descent or successive linear approximation methods, assuming the errors propagating back to C push it in the direction of this pattern of weights, in order to reduce the errors.

Now suppose the system learns a very large negative weight for \(w_{j,k}\) and an approximately half-magnitude, but still large positive weight for \(w_{C,k}\). Let's say \(c\) is large and positive and \(w_{j,k} = -2c\) and \(w_{C,k} = c\).

Then our successor node \(k\), ignoring other inputs, has in the "true" case of \(o_j\)→1, in the limit, the weighted sum \(s_k = 1\times w_{C,k}+1\times w_{j,k} = c - 2c = -c\) which is large and negative; θ(-c)→0 which is the output we want \(o_k\)→0. In the "false" case with P→0, in the limit, k's weighted sum \(s_k = 1\times w_{C,k}+0\times w_{j,k} = c+0 = c\) which is large and positive, so the output \(\theta(c)=o_k\)→1. This captures some essence of mapping P to ¬P.

It's not our fuzzy logic rule \(\neg P = 1 - P\), but let's call it a neural-network-style NOT rule, which reverses its input and, according to the magnitude of \(c\) makes the transition between 0 and 1 sharper or more gradual, and according to the ratio between \(w_{C,k}\) and \(w_{j,k}\) adjusts how far up from \(o_j=0\) to \(o_j=1\) begins the transition.

We don't know if this is even learnable. If it is learnable, then we still don't know if it is better in experiments than a fuzzy logic rule that doesn't demand that these semi-linear methods learn corner shapes - that would be a nice experiment. But we have shown here, that implementing something like negation in neural networks is not impossible, for we have seen it.

AND

The case of NOT has taught us a system for engineering desired outputs: use a constant to shift the cut point to where you want it.

In the case of AND, the sums of \((0,0), (1,0), (0,1), (1,1)\) are \(0, 1, 1, 2\) respectively, so the cut point needs to be between 1 and 2, let's say 1.5 or 3/2. So let's use C's output weight to shift the weighted sum of the others down, so that a zero value for the weighted sum lies between the cases we want to separate.

So let's have four nodes, C, P and Q at layer \(l\), and k at layer \(l+1\), with weights \(w_{C,k}, w_{P,k}, w_{Q,k}\). Outputs are \(o_c=1\), \(o_P\in [0..1], o_Q\in [0..1]\). The weighted sum in each case is

\(s_k = o_C w_{C,k} + o_P w_{P,k} + o_Q w_{Q,k}\).

Now to implement AND, set \(w_{P,k} = w_{Q,k} = c, c\) large and positive (so \(\frac{c}{2}\) is also large and positive). Then set \(w_{C,k} = -\frac{3}{2}c\). Then

\(s_k = 1\times w_{C,k} + o_P\times c + o_Q\times c = -\frac{3c}{2} + o_P\times c + o_Q\times c\).
In case P and Q are 1, \(s_k = -\frac{3c}{2}+c+c = \frac{c}{2}\), which is still large and positive, and \(o_k = \theta(\frac{c}{2})\)→ 1.
In case either P or Q is 0, \(s_k = -\frac{3c}{2}+c = -\frac{c}{2}\), which is large and negative, so \( o_k = \theta(-\frac{c}{2})\)→0.
Just to confirm, if both P and Q are 0, \(s_k = -\frac{3c}{2}\), which is even more large and negative, so still \(o_k = \theta(-\frac{3c}{2})\)→ 0.

Bingo. We have implemented something like an AND by setting NN weights. It gets the corner cases right; it allows us to more sharply or more slowly ramp the transition at our chosen centerpont of \(\frac{3}{2}\) according to size of \(c\); and it allows us even to shift that centerpoint closer to (1,1) or closer to (1,0),(0,1), if we want, so it has some flexibility. On the other hand, this approach shoves the double-nought case double-far into the exponentially-close-to-zero corner, which isn't exactly the equal treatment expected from a true AND concept. In the log domain, this would be a disaster, but since we are using addition, as we do right away in going to the next layer of nodes, it may be okay. A loose zero versus a tight zero won't make much difference when scaled and summed at the next level, their contribution will again be close to zero. Is this a problem? Do neural networks ignore FALSE values? A weighted sum including a zero could just as well exclude it. Apparently they do!

This is worth deeper thought. All the weights are importances of their inputs, even negative importances, but a zero output value from a previous layer says, please ignore me. Evidently Neural Networks implement a kind of one-sided logic, only paying attention to positive examples, and if a classifier node says this input is OUTSIDE my classification zone (by having an output \(o=0\)), then the rest of the neural network immediately goes about completely ignoring that important bit of information. Our NOT trick above, with extra constant nodes and weights, or a Fuzzy-Logic-Enhanced approach, also with an extra node and weight for each combination of inputs, might be quite the helpful enhancement for a more intelligent neural network.

What about OR?

OR

It's the same deal: spread out the weighted sums by using big weights, then shift them over so a threshold around zero separates them where you want to. For OR, we just apply less downshift via our weight \(w_{C,k}\). In case case of AND, we shifted the sum \(o_P,o_Q\) down so that after being weighted by large weights, the sum is over zero only for the largest sum, where \(o_P, o_Q\) are both close to 1. So goes my informal thinking.

Now we want to shift the weighted sum of \(o_P,o_Q\) down so most of it is above zero, and does so as a weighted sum when either \(o_P\) or \(o_Q\) approaches one:

Set \(w_{C,k} = -\frac{c}{2}\) retaining \(w_{P,k} = w_{Q,k} = c, c\) large and positive. Then

Wheresumoutput
P=1,Q=1:\(s_k = -\frac{c}{2} + c + c\)\(=\frac{3c}{2}\)\(o_k =\theta(\frac{3c}{2})\)→1
P=1,Q=0:\( s_k = -\frac{c}{2} + c \)\(=\frac{c}{2}\)\(o_k =\theta(\frac{c}{2})\)→1
P=0,Q=1:\( s_k = -\frac{c}{2} + c \)\(=\frac{c}{2}\)\(o_k =\theta(\frac{c}{2})\)→1
P=0,Q=0:\( s_k = -\frac{c}{2} \)\(=-\frac{c}{2}\)    \(o_k =\theta(-\frac{c}{2})\)→0

This captures our desired corner cases, at least, for a neural-net-style Logical OR. Again, larger \(c\) makes for a sharper transition, and the \(\frac{1}{2}\) ratio between \(w_{C,k}\) and \(w_{P,k}, w_{Q,k}\) is actually a variable which can move the transition centerpoint closer to (0,0) or closer to (1,0),(0,1), And, mirroring the AND case, here the double-ones get shoved double-far into the exponentially-close-to-one corner, which, just like with NN AND isn't exactly the equal treatment expected from a true OR concept, but hopefully makes no difference since a weight and a sum farther along the network will make the difference between a tight one and a loose one very small indeed, when it is combined with other outputs in a weighted sum.

IMPLIES

NOT, AND, OR, that's the foundation.

Now what about IMPLIES (and XOR)?

Here's a false path. A positive weight from a node at one level to a successor node at the next level, might be taken to mean something like IMPLIES.

There are the four cases:

PQP→Q
TT T
TF F
FT T
FF T

None of these involves two nodes; they all involve three: the two inputs, and the one output which is the value of P→Q given the two input values for P and for Q. So successive node activation in neural nets is not a model of logical implication.

Think about it.

IF the predecessor node is substantially true, THEN the successor node will be substantially influenced toward being true by a large positive weight between them. So far so good. True leads to true, which sounds a lot like logical implication.

But IF the predecessor node is substantially false, its output is close to zero, THEN the successor node will basically ignore the predecessor, since zero times whatever weight links them is still close to zero, so that predecessor will add little or nothing to the sum of inputs that the successor will spark off on if the sum is large enough. It doesn't drive the successor to false but to ignorance: θ(0)= 0.50, which means Don't Know, on the scale of [0..1] mapping to [No .. Yes].

A forward link is not a logical IMPLIES relationship.

So let's try again. Another way to ask the question is, Can plain Neural Networks encode subset relationships?

The logical IMPLIES operation is really an encoding for a subset relationship. For example, since dogs are a subset of mammals, then dog(x) → mammal(x); and dog(x) → mammal(x) is always true in a world in which dogs are (a subset of) mammals. So subset and → are essentially equivalent.

Logical IMPLIES wants the output to be true when the (first) input is false; and that makes more sense to me when I think of it as a subset relationship: P → Q means P is a subset of Q. A proposition P(x)→Q(x) is false only when the thing x is claimed to be in the subset, but it is NOT in the superset. Did you process that? I wrote it, and understood it when I composed the sentence, but by the time I wrote it down, my eyes were crossed. Re-reading after a nap, it makes sense again.

Yes. P IMPLIES Q can be implemented as Q OR NOT P, which in our concept requires a second layer. The first layer feeds forward from \(l\) to \(l+1\) to convert a value \(P\in [0..1]\) to \(\neg P \in [1..0]\) and the second layer from \(l+1\) to \(l+2\) to convert \(\neg P \in [1..0]\) along with \(Q\in [0..1]\) to something like \(\neg P \vee Q\) ⌶ \(P\)→\(Q\). (Notation here)

Similarly, P XOR Q can be implemented various ways. In Fuzzy Logic I like the average of |P-Q| and 1-|P+Q-1|, which is nice and linear, also fast in integer CPUs with few adds, two sign-drops, and a bit-shift (to divide by 2). Could we do all that, and subtract 1/2, through the veil of intervening weights, in CENN's? Maybe, I haven't worked it out.

But certainly, in constant-enhanced Neural Nets, we can use \((P\vee Q) \wedge \neg(P\wedge Q)\) to build that in, across 3 layers.

So to conclude: No, we don't need Fuzzy Logic Enhancements for there to exist the possibility of logical operations in a Neural Network.

Let's call them CENNs, Constant Enhanced Neural Networks. To a vanilla neural network, we add one constant-output node per layer, and a row in the weights matrix between that layer and the next, sparsely or fully populated with weights to every successor of another node or pair of nodes in the layer that might, with learning, be usefully subjected to a logical operation.

We don't know if this is even learnable. If it is learnable, then we still don't know if it is better in experiments than a fuzzy logic rule that doesn't demand that these semi-linear methods learn corner shapes - that would be a nice experiment. But we have shown here, that implementing something like negation, conjunction, disjunction, implications, and even the exclusive-or, within nearly-vanilla Neural Networks is not impossible, for we have seen it.

Open Questions

So, yes, questions remain, of learnability, effectiveness, and understandability or clarity.

Clarity

What happens to logic in the CENN approach outside a two-to-one mapping, that is, in a many-to-many mapping between layers? The logical combinations we have discussed consider only one or two nodes at a time, along with a single constant-output node at each layer, feeding forward to a single output node which to be interpretable can only take inputs from the constant node and these one or two. Doesn't the whole effect of some carefully matched values of \(c\) in both input nodes, or this special ratio of \(\frac{3}{2}\) between the constant's weight and the other inputs' weights, don't those get mixed up and lost when other nodes with significant weights are also feeding into our downstream results node? I don't see how a node output could remain interpretable as any kind of truth value, discrete or fuzzy or otherwise, if it receives a large number of inputs.

In this dimension of interpretability, it is the Fuzzy-Logic Enhancements which win over CENNs. In FLENNs, the weight from a logical-combination node to anything in the succeeding layer, if it develops a large value, directly asserts the fuzzy-logical meaning that the combination node implements, whether that's an AND, OR, NOT, IMPLIES, or XOR. We labelled the node when we made it, and we set up its deterministic, non-learnable formula to combine its specific one or two inputs. So we know. What the system learns is whether to put any weight on the output of that logical combination, whether it is usefully meaningful as the layers feed forward to reduce the total errors. It doesn't construct logicality by learning weights which combine in special ways to make logical inferences, as in the CENN approach. No, FLENNs detect the value of logicality, by putting a substantial weight onto a designer-defined logical combination.

Learnability

FLENNs seem eminently learnable. The learning task is simply to detect if some logical combination is useful in reducing errors, and both gradient descent and SLA should be able to do that nicely.

In contrast, how many rounds must CENN training run for, so that useful logical combinations can be trained up? It seems a lot more adjusting would be needed, to find values for \(c\), to balance them between the inputs P and Q, to somehow get the Constant node's output to be \(c\) times the right scale factor for AND or OR or NOT. Weight space is large, and these are small subsets that must be found, and it is not clear that getting closer to useful values will produce much in the way of error reduction or gradient, especially considering possible countervailing effects in the data. We may only hope that gradient descent or successive linear approximation will encounter them in any finite number of training generations.

Experiments may be expected to confirm or dash such hopes.

Effectiveness

Indeed the value of FLENNs needs experimental validation. While it is interesting and exciting that Neural Networks might themselves better be understood if they are enhanced to operate, and become interpretable, as fuzzy-logical classifiers and reasoning systems in either the CENN or FLENN senses, the rubber needs to meet the road in experimental work to prove it out and see if this approach has any benefits as to performance. Costs in memory and CPU operations count need to be measured and found reasonable. Predictive capacity needs to be compared with alternative methods, FLENN vs CENN vs others. Apples to apples comparisons based on numbers of nodes, numbers of weights, numbers of calculations for feedforward application and for backpropagation training, per generation and per decrement in errors, for different tasks, might tell us if FLENN enhancements are unreasonably costly for insufficient benefit. Experimental variables to be explored include the number of node pairs to be enhanced, and the number and mixture of types of logical combinations added, are experimental variables over which to study the impacts on performance on specific machine-learning problems. Non-convex problems and rich conceptual domains such as language, and would seem to be likely domains of success, where FLENNs might help.

Near Zero

When neural network training can adjust weights in extremely tiny increments, it becomes possible for the approximated values for a given weight to begin to carefully, gradually, and exactly approach a target value that happens to be close to zero with precision.

Weights very close to zero could have a tremendous logical significance in machine learning (outside NNs). If an ML method were to combine information across a graph by using multiplication, which is log-domain addition, then tiny fractions, which map in the log domain to large negative log values, have outsize effects. However in Neural Networks, weights close to zero, multiplied with the largest of input values (≤1), reduces the influence of the predecessor's output to nothing. The weighted output has no influence on the sum of the next layer's inputs, because when you add zero to a sum, you don't change the sum. It is as if this node wasn't even there, for that successor. Learning a near-zero weight means learning to ignore an input.

It's not a NOT OR AND IMPLIES or XOR, it's an IGNORE.

Fractional significances

Let me say it again. What is the effect in NNs of weights that are small fractions, close to zero?

Inverses are not importances. They don't allow us to invert our calculations, as on a log scale they might be able to do.

We reason from large weights to importance of the predecessor's classificatory information in the positive detection of this node's class. We assume that a linear combination of predecessors will get us the right result. This means a lollapalooza effect, where many inputs pointing in the same direction are added together to get a super-strong inference of that class. It also means fungibility, which is the effect that some of This can be understood as counting the same some of That Other, in the ratio of their weights, so that the sum can be the same whether it got its input from here or there. It doesn't matter which one the news came from, they count the same. Further, we allow large negative weights to cancel out the positive information gleaned other nodes, but again by sum, not by product. (If by product, then small fractions would have the effect of large negatives in this additive domain.) Here even negatives are fungible: they apply to the same inferred conclusion which is simply the resulting sum at the end of the summation of weighted inputs, and less of some positives is taken to count as the same as more of this negative. That's simply how \(\large +\) works, or here more specifically, \(s_j=\sum_i o_i w_{i,j}\).

On the other hand, when we reason about weights that are inverses of large numbers, that is, close-to-zero weights, it says that this is not important to that. When This has an output that is close to one, and the weight from This node to That node multiplies by a weight that is close to zero, it results in a lack of influence on the successor. Plenty of zeroes all through the sum will leave \(s_j\) still zero-ish, and θ(\(s_j\)) falls in the ignorant, know-nothing middle at 1/2.

So weights close to zero, fractions, inverses of large weights, should be thought of as unimportances, as ignoreabilities, not as positive information that something is not there, nor as the certainty even that the following is unknown. Because its zero-ness gets immediately lost if any other feeder of the downstream process sends forth a weighted output greater than zero. Zeroes, zero weights, zero outputs, are simply ignored. Nothing detects their presence, without specialty enhancements, as we have discussed.

If weights get close to zero, maybe at some point in training they should be floored to zero, since they are that unimportant; and then the system can be simpler and faster and hardly any less effective. We let them hang around only because they might pick up some utility later on in learning, and grow to be important again. When done learning, we could probably zero them out.

Conclusion

Messy and emergent categories come out of machine learning systems.

In my 1987 unpublished independent study of hidden Markov models, I re-estimated model parameters of randomly-initialized, N-state, fully-connected HMMs to best fit a training data set of raw text data. With N=2, the probabilities of the two states for each letter amazingly picked out the vowels as high-probability on one state, and the consonants on the other, as a generalization. The categories were not perfect; they seemed to do a best-fit job of making the most of the statistical patterns in the input data, and it was rather along the lines of a miracle that something very close to natural phonotactic categories emerged automatically from such a general statistical leverage optimizer. With N=3, punctuation dominated one of the states. With N=4 and N=5 there were syllable-position-sensitive categorizations, like /s/ is initial in syllable-onsets, and final in syllable codas, etc. At higher values of N, it didn't seem to make a lot of sense what the emerging categories might be. The way to think about it was that each state optimized for simply maximizing the joint probability of the whole observation sequence, irrespective of any preferred linguistic categories. The data rules; preconceived linguistic categories be damned.

This experience translates to the neural network case quite analogously. Each node can be considered as fuzzy-logic category detector, but the meaning of the emergent categories which the training process gradually creates, is anyone's guess, and a not-exactly probabilistic combination of multiple known categories, or a not-exactly probabilistic subdivision of a single known category according to how it best fits the data, are just a couple of ways that backpropagation might generate mysterious or incomprehensible, yet useful, emergent categories, along with their own special logic using linear transformation under non-linear thresholding, that can combine and manipulate them arbitrarily.

However, enhanced by Fuzzy-Logic nodes representing Fuzzy-Logic combinations of pairs of nodes at a given layer, computational neural networks can be made capable of learning interpretable and hopefully effective and useful, economically represented and efficiently-calculated, Fuzzy-Logical classifications and transformations at key places in their information flow from inputs to estimated outputs. A weights editor will be able to tell you what logic is being done in such an enhanced system, and you may then hope to see inside the mind of its artificial and effective intelligence. Then we can discover the logical truth, the insight-bearing informational structure of whatever problems we are studying.

In conclusion, Fuzzy Logic can enhance neural networks by providing more interpretability and easier training of specifically logical combinations. But Neural Networks also enhance Fuzzy Logic, by providing machine learning methods to make Fuzzy Logic systems learnable based on data, including big data. So although FL can be considered an enhancement to NNs, it goes the other way too: NNs offer straightforward training to the FL concept. For example, perhaps a single training example of a robot arm dragged manually through its working trajectory, can be captured and converted into an FL control system. Or, we may track and collect mechanical skills by animals or humans, and optimize FLENNs, to train them to implement that skill set. One or few examples may be enough, with judiciously chosen mapping to time and space and control systems. Multi-layer connections may be used, one for the control pattern, another to self contextualize or to learn one’s body, and learn one's environment. A perception/action loop must find the threads of the learned concepts in the percepts, and generate coordinated, modelled response concepts in its provided body.

| NN Top | Introduction | Newton's Method | Fuzzy Logic NNs | & training rule | Twist-and-Bend | Space Representing NNs | Dual Quaternions | Algebroids | Robots |

Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 2023 Thomas C. Veatch. All rights reserved.
Created: September 12, 2023