# Bayes

the math of learning

This is useful and wondrous. You might say "Wow!" when you get it.

Here's how to learn.

Noone starts with no idea; even the tabula rasa of a newborn baby has inbuilt senses which constitute limited measurement spaces and therefore implicit expectations of knowledge.

So we have some kind of idea, which might be actually equivalent to some small number N>0 of valid observations of the true process, the thing we are trying to learn about.

Then we improve our idea. Rinse, repeat.

It's called Bayes.

To understand Bayes we need some very simple stuff.

# Independence

Next July 4, ask yourself, what is Independence?

 "X is independent of Y" means P(X,Y) = P(X)*P(Y)

We have this idea in America, that what the government wants me to do, and what I do, are Independent. Because I am Independent. That's what we like to think. But let's simplify to the simplest possible thing, as fas as what could happen.

That's the definition of Independent. Get it? You multiply $$\frac{1}{2} * \frac{1}{2}$$ because the two outcomes are independent. If they weren't independent, then $$\frac{1}{2} * \frac{1}{2}$$ would be an inaccurate probability for a given joint outcome.

The degree of difference, let's call it the dependency between P(X, Y) versus P(X) * P(Y), on average, is the amount of information X gives you about Y. If there is some correlation, it gives you some information. The way to read this is:

(1) Dependency(X,Y) ≜ $$\frac{P(X, Y)}{P(X) * P(Y)}$$

and the percentage different from 1 (on average) is on one hand the Correlation between the two, and on the other hand the Information that one gives you about the other.

I'm very sensitive to whenever that ratio $$\frac{P(X, Y)}{P(X) * P(Y)}$$ appears in my focal vision, because in the expected case where everything is independent, then it's just $$1$$, it's a nice number that you can simplify to just 1 often. It's a simple idea, it's what one tells you about the other. So maybe that'll hang out in your brain and be among the things you are primed to detect, too: $$\frac{P(X, Y)}{P(X) * P(Y)}$$.

# Conditional Probability

Let's do conditional probability, a.k.a. P(X|Y). Just use the Venn diagram.

The area inside the X circle as a whole think of it as P(X).
The area inside the Y circle as a whole think of it as P(Y).
In the overlapping part of the two circles, think of that as P(X,Y), the probability of BOTH X AND Y.

(You may allow statisticians to think of X and Y as random variables which might be found to have various values in a given experiment such as $$x$$ or $$y$$ respectively so that these expressions are shorthand for $$P(X=x), P(Y=y)$$ etc.)
Now

(2) $$P(X|Y) ≜ \frac{P(X, Y)}{P(Y)}$$
That's the definition of the probability of X given Y. So also
(3) $$P(Y|X) ≜ \frac{P(X, Y)}{P(X)}$$
P(X|Y) is the fraction of P(Y) that is X AND Y, P(X, Y). That's the definition of conditional probability, the probability of something given something else.

# Bayes

Okay now we have it in our fingertips. Just do the math:

 (4) $$P(X|Y) * P(Y) = P(X, Y)$$ rearranging (2) (5) $$P(Y|X) * P(X) = P(X, Y)$$ rearranging (3) (6) $$P(X, Y) = P(X|Y) * P(Y) = P(Y|X) * P(X)$$ putting (4) and (5) together (7) $$P(X|Y) = \frac{P(Y|X)}{P(Y)} * P(X)$$ rearranging (6) Bayes (8) $$P(X|Y) = \frac{P(X, Y)}{P(X)} * \frac{1}{P(Y)} * P(X)$$ substituting (3) into (7) (9) $$P(X|Y) =$$ Dependency$$(X, Y) * P(X)$$ substituting (1) into (8) Veatch
So all this is just rearranging the furniture on (1), (2), and (3) which are all just definitions, so it's not only true but doesn't actually even say anything, so it has the general quality that it can't even be false.

Nice, huh?

Okay, what have we got here? We have P(X) which is like what you know or guess at the beginning. It's our current, momentary, temporary, probability distribution on some Random Variable or observable outcome X, which is what little we already know about X.

Okay, now we're going to learn something after watching for a while. We watch our target, X, which we are trying to learn about, and also some other stuff Y that might or might not predict what X is going to do, and if it does, and we see how it does, then we might learn something. So we make some observations, and now the question is, this:

What is the relationship between our previous initial model of X, namely P(X), and our new updated model of X, which is what we move to after we are given our observations Y, namely P(X|Y)?
The relationship is on one hand the hard-to-interpret (7) above (Bayes) which is too much for my small brain, or alternatively (9) above (Veatch) which is immediately interpretable to anyone who knows what "Independence" and "Dependency" mean, which by now I hope includes you.

So: take your current model P(X), multiply by the dependency $$\frac{P(X, Y)}{P(X)*P(Y)}$$ between X and Y, and that's your improved model P(X|Y) after making your observations Y.

In case X and Y are completely independent and knowing Y doesnâ€™t tell you anything to narrow down X, then Dependency = 1 and it cancels out and you learn nothing, so P(X|Y)=P(X). But if you get some traction on X with something observeable Y that does actually predict X at least a little bit, then you get some learning, and your new model P(X|Y) is the one where you know about Y now and how it predicts the outcome of X.

In short, every time you contact actual reality, multiply the newly observed Dependency into your model, and if you can now predict better because you've found actual dependencies, then your model will be getting better and better.

Rinse, repeat.

That's Bayes. That's how we learn.

 Your thoughts? (will not be shared or abused) Comment: Feedback is welcome.