A data collector app with Fisher's Exact Test

For a scientist in the field

Science can get complicated, but the base case is simply saying stuff about stuff: predication. Say, roses are red, for example, or quite generally Xs are Y: the predicate or property Y, like being red, applies to the category or group X, like roses. Predication is simple and general.
The structure and properties of predication, to at least a first approximation, apply to (1) human language sentence structure (syntax), (2) meaningful assertions about reality (semantics), and (3) interpersonally received messaging (pragmatics) with (1) Subject and Predicate as the basic syntactic structure in many languages, or with (2) the basic form of statements in logic as found in the predicate calculus, and with (3) the Topic and Comment structure of linguistic pragmatics.
Almost anything you might say could be like this. Any property, attribute, or assertion about anything or about any category could be seen as having this structure. A scientist wants to know if it's really true and also to be able to persuade others it's true.

Here we translate simple predication into the statistical proof language of scientific argument, by consistently observing the facts of reality, keeping track of the counts of what we see, and calculating whether the counts observed are even slightly compatible with the opposite of your assertion. If so, your assertion is weak, flimsy, or false; if not, you win, or at least your idea survives. When the opposite, or null, hypothesis is very, very improbable, then your assertion, or something like it, becomes highly likely and also shareable, or at least is a good basis for an argument. Simple.

What was hard for me to wrap my head around is that this isn't just counting out a dozen, say, red roses. Your counts need to show a difference, a real difference. Because maybe you wear rose colored glasses and everything is red to you; no-one should have to believe someone wearing rose colored glasses. So what kind of difference, and what makes it a real difference?

In 1935 a curious and thoughtful Englishman named Fisher wrote about this in The Mathematics of a Lady Tasting Tea to explain why, and how to calculate whether, some counts show a real difference.

For Fisher, what is a difference? It's not two numbers, it's four. A difference shows a real difference if it is two differences and THEY are different. Suppose you have ten rookies, and ten veterans, is that a difference? No, not yet! You may say they're different, but your words are meaningless unless you can (statistically) prove it. Maybe, for example, the veterans actually never learned anything because the coaching is useless. You have to show they are different by observing a difference. So you put them to a task, say, and mostly the veterans can do it, and mostly the rookies can't. That could be a difference. If they are all about the same, then they aren't actually different. So the setup has to be

Rookies That Can, Rookies that Can't,
Veterans That Can, Veterans that Can't.
Four numbers. Like this.

 Rookies Veterans Side Total Can A B A+B Can't C D C+D Bottom Total A+C B+D The Total: A+B+C+D

Fisher gave us the math; thank you, Ronald! He uses those four numbers, A, B, C, and D, to calculate the probability it was just random. If you assume instead that there's no actual difference between the groups, the counts might still look like that by random chance. Random means you use the row frequency to put things into rows, and the column frequency to put things into columns, and you don't say oh with the special combination of this row and this column we're going to throw in a special bump in the frequency because that column and that row is special. No, not that.

"No special bumps" is called the Null Hypothesis, and even though it's Null it still tells us enough that we can use it to actually calculate the probability of a random event like your actual counts based on the Null Hypothesis being true. Here the Null Hypothesis is that both rows have the same proportions and any difference is random. And calculating that random probability is called the Fisher Exact Test.

Fisher used it for the Lady Tasting Tea. Sit back, it's story time!

It was 1919 or perhaps 1922, he was nice, he made some tea for his co-worker's fiancé, who was working a lot with algae, so maybe she was sensitized by all the slime around, she refused it. She said, You put the milk into the tea, I prefer the opposite. He scoffed, of course there's no difference which order you pour in, hahaha. And her aspiring fiancé is there having tea too. He obviously wanted to be on her side, because later he did become her fiancé, and later still her husband, but he was probably also working with or for Fisher so he had to take his side too, so being caught in the middle, he said, "Let's test her." Apparently they did the experiment, and apparently she passed, amazing everyone. (Did you ever microwave tea with milk and notice a difference? Exactly.) Well, at least by 1935 when Fisher published it, this made a fine story about the advancement of science by having a tea party, and about men opening their minds enough to possibly listen to women, at least if she is your fiancé, and about Fisher, who turned the humiliation of his arrogance into a triumph for the world. Muriel Bristol, the magician of the story, was promptly forgotten, but for a type of green algae named for her, C. Muriella, but who knows that. Except you can read about it here and here.
There were two groups, tea first then milk, and milk first then tea. And two labels: the lady called it one way, and the lady called it the other way. The groups are about the tea itself, the labels are about the lady herself, thus: "Tea Lady". With 4 cups in each group, and all presented randomly on the tray, the probability she could have got them all right by random choice is: 4/8 that she got the first choice right (right?), times 3/7 that she got the second choice right, times 2/6 she got the third choice right, times 1/5 she got the last choice right. (4*3*2*1)/(8*7*6*5) = 24/1680 = 1/70.
```   A       (A-1)        (A-2)            (2)        (1)
------- * --------- * --------- * ...  -------  * -------
B+C+D+A  (B+C+D+A-1) (B+C+D+A-2)      (B+C+D+2)  (B+C+D+1)
```
If Muriel was just guessing, still, the probability of getting all her guesses right in such an experiment is 1/70. Less than 5%, less than 2%, more than 1%. Pretty improbable. The fiancé later said she got enough right to prove her point, which if it was 8 cups 50/50 means she got them all right, which we can calculate to be a low probability random event, which means she very likely wasn't guessing randomly.

The Fisher Exact Test is calculated pretty much like this; it's a little more complicated if she got some wrong, then those could have been in any order, so you have to count those orders in and divide them out again to get the answer. I won't explain how that goes but it's called Combinatorics and it's in the first part of probability class, so you can probably read about it and figure it out by yourself if you like.

Anyway by this reasoning we can calculate the probability of the four cell table coming out of pure random chance, given the row counts and the column counts. So that's what we do here.

So this web app helps you to watch, notice, and classify your observations. For each observation, tap on one of the four cells and watch the count go up. TeaLady keeps track of the counts for you, and calculates the probability that the two groups produce your two labels at the same frequency, and if that probability gets tiny, then you probably have a difference. Try it now with eight taps, Milk-first and Tea-first, and Guessed-milk-first and guessed-tea-first. It'll tell you the probability that there's really nothing going on.

A final point, and a footnote.