Tea Lady

A data collector app with Fisher's Exact Test

For a scientist in the field

Science can get complicated, but the base case is simply saying stuff about stuff: predication. Say, roses are red, for example, or quite generally Xs are Y: the predicate or property Y, like being red, applies to the category or group X, like roses. Predication is simple and general.
The structure and properties of predication, to at least a first approximation, apply to (1) human language sentence structure (syntax), (2) meaningful assertions about reality (semantics), and (3) interpersonally received messaging (pragmatics) with (1) Subject and Predicate as the basic syntactic structure in many languages, or with (2) the basic form of statements in logic as found in the predicate calculus, and with (3) the Topic and Comment structure of linguistic pragmatics.
Almost anything you might say could be like this. Any property, attribute, or assertion about anything or about any category could be seen as having this structure. A scientist wants to know if it's really true and also to be able to persuade others it's true.

Here we translate simple predication into the statistical proof language of scientific argument, by consistently observing the facts of reality, keeping track of the counts of what we see, and calculating whether the counts observed are even slightly compatible with the opposite of your assertion. If so, your assertion is weak, flimsy, or false; if not, you win, or at least your idea survives. When the opposite, or null, hypothesis is very, very improbable, then your assertion, or something like it, becomes highly likely and also shareable, or at least is a good basis for an argument. Simple.

What was hard for me to wrap my head around is that this isn't just counting out a dozen, say, red roses. Your counts need to show a difference, a real difference. Because maybe you wear rose colored glasses and everything is red to you; no-one should have to believe someone wearing rose colored glasses. So what kind of difference, and what makes it a real difference?

In 1935 a curious and thoughtful Englishman named Fisher wrote about this in The Mathematics of a Lady Tasting Tea to explain why, and how to calculate whether, some counts show a real difference.

For Fisher, what is a difference? It's not two numbers, it's four. A difference shows a real difference if it is two differences and THEY are different. Suppose you have ten rookies, and ten veterans, is that a difference? No, not yet! You may say they're different, but your words are meaningless unless you can (statistically) prove it. Maybe, for example, the veterans actually never learned anything because the coaching is useless. You have to show they are different by observing a difference. So you put them to a task, say, and mostly the veterans can do it, and mostly the rookies can't. That could be a difference. If they are all about the same, then they aren't actually different. So the setup has to be

Rookies That Can, Rookies that Can't,
Veterans That Can, Veterans that Can't.
Four numbers. Like this.

RookiesVeteransSide Total
Can A B A+B
Can't C D C+D
Bottom Total A+CB+D The Total: A+B+C+D

Fisher gave us the math; thank you, Ronald! He uses those four numbers, A, B, C, and D, to calculate the probability it was just random. If you assume instead that there's no actual difference between the groups, the counts might still look like that by random chance. Random means you use the row frequency to put things into rows, and the column frequency to put things into columns, and you don't say oh with the special combination of this row and this column we're going to throw in a special bump in the frequency because that column and that row is special. No, not that.

"No special bumps" is called the Null Hypothesis, and even though it's Null it still tells us enough that we can use it to actually calculate the probability of a random event like your actual counts based on the Null Hypothesis being true. Here the Null Hypothesis is that both rows have the same proportions and any difference is random. And calculating that random probability is called the Fisher Exact Test.

Fisher used it for the Lady Tasting Tea. Sit back, it's story time!

It was 1919 or perhaps 1922, he was nice, he made some tea for his co-worker's fiancé, who was working a lot with algae, so maybe she was sensitized by all the slime around, she refused it. She said, You put the milk into the tea, I prefer the opposite. He scoffed, of course there's no difference which order you pour in, hahaha. And her aspiring fiancé is there having tea too. He obviously wanted to be on her side, because later he did become her fiancé, and later still her husband, but he was probably also working with or for Fisher so he had to take his side too, so being caught in the middle, he said, "Let's test her." Apparently they did the experiment, and apparently she passed, amazing everyone. (Did you ever microwave tea with milk and notice a difference? Exactly.) Well, at least by 1935 when Fisher published it, this made a fine story about the advancement of science by having a tea party, and about men opening their minds enough to possibly listen to women, at least if she is your fiancé, and about Fisher, who turned the humiliation of his arrogance into a triumph for the world. Muriel Bristol, the magician of the story, was promptly forgotten, but for a type of green algae named for her, C. Muriella, but who knows that. Except you can read about it here and here.
There were two groups, tea first then milk, and milk first then tea. And two labels: the lady called it one way, and the lady called it the other way. The groups are about the tea itself, the labels are about the lady herself, thus: "Tea Lady". With 4 cups in each group, and all presented randomly on the tray, the probability she could have got them all right by random choice is: 4/8 that she got the first choice right (right?), times 3/7 that she got the second choice right, times 2/6 she got the third choice right, times 1/5 she got the last choice right. (4*3*2*1)/(8*7*6*5) = 24/1680 = 1/70.
   A       (A-1)        (A-2)            (2)        (1)
------- * --------- * --------- * ...  -------  * -------
B+C+D+A  (B+C+D+A-1) (B+C+D+A-2)      (B+C+D+2)  (B+C+D+1)
If Muriel was just guessing, still, the probability of getting all her guesses right in such an experiment is 1/70. Less than 5%, less than 2%, more than 1%. Pretty improbable. The fiancé later said she got enough right to prove her point, which if it was 8 cups 50/50 means she got them all right, which we can calculate to be a low probability random event, which means she very likely wasn't guessing randomly.

The Fisher Exact Test is calculated pretty much like this; it's a little more complicated if she got some wrong, then those could have been in any order, so you have to count those orders in and divide them out again to get the answer. I won't explain how that goes but it's called Combinatorics and it's in the first part of probability class, so you can probably read about it and figure it out by yourself if you like.

Anyway by this reasoning we can calculate the probability of the four cell table coming out of pure random chance, given the row counts and the column counts. So that's what we do here.

So this web app helps you to watch, notice, and classify your observations. For each observation, tap on one of the four cells and watch the count go up. TeaLady keeps track of the counts for you, and calculates the probability that the two groups produce your two labels at the same frequency, and if that probability gets tiny, then you probably have a difference. Try it now with eight taps, Milk-first and Tea-first, and Guessed-milk-first and guessed-tea-first. It'll tell you the probability that there's really nothing going on.

I hope you will find many uses for this app. Like: driving down the road in Kenmore, we see men and women (seemingly), wearing facemasks for the coronavirus pandemic, and not wearing facemasks. Being curious whether this relates to the fact that men die from the virus more, we start tapping away, man with mask, man without mask, woman without mask, woman with mask. If it's really clear then we might not need too much data, but if the effect is small we might need a lot to tell the difference between frequencies of wearing masks. I can tap away making observations anywhere through my phone, and Tea Lady will tell me how probable it is to make those counts without any difference between men and women in their face-mask-wearing behavior. (I finally actually did this on June 23, 2020 in Seattle, and with the total number over 100, the result was P=0.021, and yes it was men wearing fewer masks.)

A final point, and a footnote.

The final point is, just don't stop here; there's a lot more to learn. Statistics and experiment design are worth study. Are your observations consistent and reliable? Is your data more complex than two labels on two groups? Did you define your experiment in advance so the chips could fall honestly? Tea Lady is a good hint, but it's just a hint. It's not a one-tailed test (which would include all the more extreme counts that there could have been, in this direction) nor a two tailed test (including the more-extreme counts that run the other way too). Also, Tea Lady lets you cheat by collecting data until you get a result you like, which is a potentially dangerous idea that should encourage your skeptics. If I can assume that your cost of data collection isn't that high, then I wouldn't be too happy with you if you stopped at P<0.05. If your difference is actually real, and you can collect more data without that much more trouble, then I'd say the onus is on you to collect enough to get P<0.01, or, frankly, you would be hard to separate from the bullshitter that ran 20 experiments to get one reportable result. It's called P-hacking; Not Good. It's because in using Tea Lady here we are merging exploration with experiment. Getting the idea and proving the idea. If you already know a bit about your problem already, like for example you have an estimate of the different frequencies for the two columns, then you can and should decide in advance how many observations your experiment will collect. With so many non-replicable results out there, let's please not contribute to that. One solution would be to do the experiment twice, once to estimate how much data is needed for a given level of significance, and another to actually test the question: if you get significance both times, then at least your second test wasn't P-hacking.

But wherever you are, perhaps sitting in a chair and watching the unfolding world, here with the Tea Lady is a great place to start.

Footnote: I'm a linguist, except for the fact that linguists are incapable of thinking about or being persuaded by anything that fails to use the structuralist method, which always seemed a curious limitation to me. So the question occurs to me, Is this the same as the structuralist method, but considered statistically? Both methods say, a difference when it occurs differently in two contexts, is a real difference.

The structuralist method is to substitute members of a category into the same context, observing a resulting difference, and therefore to infer a new category, perhaps a subcategory within that category. Let's take count and mass nouns. So for example "there is milk" is a normal sentence of English, so to speak, but "there is house" is not quite so. So here we have a context, "There is ___". And we have members of a category, nouns like house and milk. Some are fine in that context, others are not, and thus we can declare that nouns are made in two subcategories, and if we are the first to notice this we get to give a name to the new categories and proclaim the advancement of science. Yay. In this case they are already known as count nouns like houses chairs people and spoons, and mass nouns like space, furniture, milk, and rice. But generally, we are determining that there is something there, in the difference between two types of things. Structuralist method uses an explicit context of sameness, which may be present or absent in a Tea Lady description, and structuralist method also uses the behavior in that context to classify elements into the new subcategories, rather than pre-classifying the members of the subcategories just to discover if their behavior is even different in the first place. So these are analogous, similar, parallel, but not quite the same, as methods of science. Actually the difference is a bit subtle.

If the structuralist method required TWO contexts, it would be even more similar. For example consider "There IS __." and "There is MY ___." as contexts, and "milk" and "house" as things in the context. This makes a four-cell table just like the Fisher Exact Test, and if it is so entirely obvious to the linguist and to the linguist's readers that there is something unique in one cell that you don't even have to count it and run the statistic to prove the point, then the reasoning that follows in linguistics is identical to any other scientific reasoning based on Fisher and the Lady Tasting Tea.


Your thoughts?
                                          Feedback is welcome.
Copyright © 2000-2020, Thomas C. Veatch. All rights reserved.
Modified: April 18, 2020