Bayes

the math of learning


This is useful and wondrous. You might say "Wow!" when you get it.

Here's how to learn.

Noone starts with no idea; even the tabula rasa of a newborn baby has inbuilt senses which constitute limited measurement spaces and therefore implicit expectations of knowledge.

So we have some kind of idea, which might be actually equivalent to some small number N>0 of valid observations of the true process, the thing we are trying to learn about.

Then we improve our idea. Rinse, repeat.

It's called Bayes.

To understand Bayes we need some very simple stuff.

Independence

Next July 4, ask yourself, what is Independence?

"X is independent of Y"   means   P(X,Y) = P(X)*P(Y)

We have this idea in America, that what the government wants me to do, and what I do, are Independent. Because I am Independent. That's what we like to think. But let's simplify to the simplest possible thing, as fas as what could happen.

For example two coin flips (X=Heads, Y=Heads) are independent if

P(X=Heads AND Y=Heads) = P(X=Heads) * P(Y=Heads).
That's the definition of Independent. Get it? You multiply \( \frac{1}{2} * \frac{1}{2} \) because the two outcomes are independent. If they weren't independent, then \( \frac{1}{2} * \frac{1}{2} \) would be an inaccurate probability for a given joint outcome.

The degree of difference, let's call it the dependency between P(X, Y) versus P(X) * P(Y), on average, is the amount of information X gives you about Y. If there is some correlation, it gives you some information. The way to read this is:

(1) Dependency(X,Y) ≜ \(\frac{P(X, Y)}{P(X) * P(Y)}\)

and the percentage different from 1 (on average) is on one hand the Correlation between the two, and on the other hand the Information that one gives you about the other.

I'm very sensitive to whenever that ratio \(\frac{P(X, Y)}{P(X) * P(Y)}\) appears in my focal vision, because in the expected case where everything is independent, then it's just \(1\), it's a nice number that you can simplify to just 1 often. It's a simple idea, it's what one tells you about the other. So maybe that'll hang out in your brain and be among the things you are primed to detect, too: \(\frac{P(X, Y)}{P(X) * P(Y)}\).

Conditional Probability

Let's do conditional probability, a.k.a. P(X|Y). Just use the Venn diagram.

The area inside the X circle as a whole think of it as P(X).
The area inside the Y circle as a whole think of it as P(Y).
In the overlapping part of the two circles, think of that as P(X,Y), the probability of BOTH X AND Y.

(You may allow statisticians to think of X and Y as random variables which might be found to have various values in a given experiment such as \(x\) or \(y\) respectively so that these expressions are shorthand for \(P(X=x), P(Y=y)\) etc.)
Now

(2) \(P(X|Y) ≜ \frac{P(X, Y)}{P(Y)}\)
That's the definition of the probability of X given Y. So also
(3) \(P(Y|X) ≜ \frac{P(X, Y)}{P(X)}\)
P(X|Y) is the fraction of P(Y) that is X AND Y, P(X, Y). That's the definition of conditional probability, the probability of something given something else.

Bayes

Okay now we have it in our fingertips. Just do the math:

(4) \(P(X|Y) * P(Y) = P(X, Y) \)rearranging (2)

(5) \(P(Y|X) * P(X) = P(X, Y) \) rearranging (3)

(6) \(P(X, Y) = P(X|Y) * P(Y) = P(Y|X) * P(X) \) putting (4) and (5) together

(7) \(P(X|Y) = \frac{P(Y|X)}{P(Y)} * P(X) \) rearranging (6) Bayes

(8) \(P(X|Y) = \frac{P(X, Y)}{P(X)} * \frac{1}{P(Y)} * P(X) \) substituting (3) into (7)

(9) \(P(X|Y) = \) Dependency\((X, Y) * P(X) \) substituting (1) into (8) Veatch

So all this is just rearranging the furniture on (1), (2), and (3) which are all just definitions, so it's not only true but doesn't actually even say anything, so it has the general quality that it can't even be false.

Nice, huh?

Okay, what have we got here? We have P(X) which is like what you know or guess at the beginning. It's our current, momentary, temporary, probability distribution on some Random Variable or observable outcome X, which is what little we already know about X.

It might be worth one or 100 observations, but your past knowledge of all things is not actually worth nothing at all, it's worth at least a little. It's the model you have so far, the model which you use to learn stuff, probably mainly that the range of possibilities is wide, but it at least should overlap with some of the places you are willing and able to look. Then getting the first new actual observation wherever it might be, would be about as valuable as some fraction of your pre-existing expectations. I mean, good to know it lands in the zone I'm looking in, in the first place, that's confirmatory, but there would be nothing known or learned at all if I didn't have some places to actually look that it might land in, either. So those previous expectations and ideas about what to look at do count as some kind of knowledge, at least equally, or really probably a lot more. Before watching one thing happen you already had to establish some range of places to look at, which is called problematizing your problem and operationalizing what it might be like to actually have an instance of your problem with an actual outcome. That preliminary work, which has to be done before you can make your first measurement, is a lot more profound and impactful than just making a measurement, which might be so noisy you just have to start over and refine your measurement space still more before you can really begin. Having a good measurement space is a pretty huge advance over complete ignorance, so I'd count it as far more than zero observations. Anyway that implies you have some preliminary, if much improveable, model: P(X).

Okay, now we're going to learn something after watching for a while. We watch our target, X, which we are trying to learn about, and also some other stuff Y that might or might not predict what X is going to do, and if it does, and we see how it does, then we might learn something. So we make some observations, and now the question is, this:

What is the relationship between our previous initial model of X, namely P(X), and our new updated model of X, which is what we move to after we are given our observations Y, namely P(X|Y)?
The relationship is on one hand the hard-to-interpret (7) above (Bayes) which is too much for my small brain, or alternatively (9) above (Veatch) which is immediately interpretable to anyone who knows what "Independence" and "Dependency" mean, which by now I hope includes you.

So: take your current model P(X), multiply by the dependency \(\frac{P(X, Y)}{P(X)*P(Y)}\) between X and Y, and that's your improved model P(X|Y) after making your observations Y.

In case X and Y are completely independent and knowing Y doesn’t tell you anything to narrow down X, then Dependency = 1 and it cancels out and you learn nothing, so P(X|Y)=P(X). But if you get some traction on X with something observeable Y that does actually predict X at least a little bit, then you get some learning, and your new model P(X|Y) is the one where you know about Y now and how it predicts the outcome of X.

In short, every time you contact actual reality, multiply the newly observed Dependency into your model, and if you can now predict better because you've found actual dependencies, then your model will be getting better and better.

Rinse, repeat.

That's Bayes. That's how we learn.

Your thoughts?
(will not be shared or abused)
Comment:
                                          Feedback is welcome.
Copyright © 20024 Thomas C. Veatch. All rights reserved.
Transcribed: June 7, 2024