Tea Lady

A data collector app with Fisher's Exact Test

For a scientist in the field


Tea Lady is a web app that helps you to watch, notice, and classify your observations. For each observation, tap on one of the four cells and watch the count go up. TeaLady keeps track of the counts for you, and calculates the probability that the two groups produce your two labels at the same frequency, and if that probability gets tiny, then you probably have a difference. Try it now with eight taps, Milk-first and Tea-first, and Guessed-milk-first and guessed-tea-first. It'll tell you the likelihood that there's really nothing going on.

I hope you will find many uses for this app. Like: driving down the road in Kenmore, we see men and women (seemingly), wearing facemasks for the coronavirus pandemic, and not wearing facemasks. Being curious whether this relates to the fact that men die from the virus more, we start tapping away, man with mask, man without mask, woman without mask, woman with mask. If it's really clear then we might not need too much data, but if the effect is small we might need a lot to tell the difference between frequencies of wearing masks. I can tap away making observations wherever I am using my phone, and Tea Lady will tell me how probable it is to make those counts without any difference between men and women in their face-mask-wearing behavior. (I finally actually did this on June 23, 2020 in Seattle, and with the total number over 100, the result was P=0.021, and yes it was men wearing fewer masks.)

Simple Science

Science can get complicated, but the basics can be simple. The base case in science is simply saying stuff about stuff. We call it predication. Say, "Roses are red", for example, or quite generally Xs are Y: the predicate or property Y, like being red, applies to the category or group X, like roses. Predication is simple and general . Almost anything you might say is like this. Any attribute, property, or assertion about any category or group of things has exactly this structure: (1) the property applies to (2) the category and it's (3) true. And we don't just believe any old thing because someone said so even if it's you or me or the boss who said so. No. Reality has to say so.

So let's say we want to know if some category of things has some property in reality. Then we design an experiment that lets us observe and count cases so as to calculate the likelihood our assertion is false. That's right, we don't use the counts to pick true versus false, instead we set up a Null Hypothesis against our assertion, which is such a clear case that it lets us actually do math to calculate the likelihood of the observed counts under the Null Hypothesis, and IF that likelihood is very low, THEN we conclude that the Null Hypothesis is likely to be wrong, and therefore that what's left over, our assertion, is likely to be true. It's backward reasoning, but it's tight. This is the language of statistical proof. The null hypothesis is the assertion that nothing is going on here, that your assertion fails to predict, that the association you would expect doesn't show up in the data. If the null hypothesis can be shown very unlikely, then you win, your assertion is left standing.

Make sense? Okay, let's do it, then. So here's the scientist, you, and you consistently observes some actual reality observeable over and over again on many different occasions, and you count up four kinds of cases: The cases that are in that category and not in that category, and the cases which have that property and don't have that property. This makes a four celled table like the one above.

What was hard for me to wrap my head around is that this isn't just counting out a dozen, say, red roses. Your counts need to show a difference, a real difference. Because maybe you wear rose colored glasses and everything is red to you; no-one should have to believe someone wearing rose colored glasses. So what kind of difference, and what makes it a real difference?

In 1935 a curious and thoughtful Englishman named Fisher wrote about this in The Mathematics of a Lady Tasting Tea to explain why, and how to calculate whether, some counts show a real difference.

For Fisher, what is a difference? It's not two numbers, it's four. A difference shows a real difference if it is two differences and THE DIFFERENCES ARE DIFFERENT.

Suppose you have ten rookies, and ten veterans, is that a difference? No, not yet! You may say they're different, but your words are meaningless unless you can (statistically) prove it. Maybe, for example, the veterans actually never learned anything because the coaching is useless. You have to show they are different by observing a difference. So you put them to a task, say, and mostly the veterans can do it, and mostly the rookies can't. That could be a difference. If they are all about the same, then they aren't actually different. So the setup has to be

Rookies That Can, Rookies that Can't,
Veterans That Can, Veterans that Can't.
Four numbers. Like this.

RookiesVeteransSide Total
Can A B A+B
Can't C D C+D
Bottom Total A+CB+D The Total: A+B+C+D

Fisher gave us the math; thank you, Ronald! He uses those four numbers, A, B, C, and D, to calculate the probability it was just random. If you assume instead that there's no actual difference between the groups, the counts might still look like that by random chance. Random means you use the row frequency to put things into rows, and the column frequency to put things into columns, and you don't say oh with the special combination of this row and this column we're going to throw in a special bump in the frequency because that column and that row is special. No, not that.

"No special bumps" is called the Null Hypothesis, and even though it's Null it still tells us enough that we can use it to actually calculate the probability of a random event like your actual counts based on the Null Hypothesis being true. Two rows would certainly be "the same" (proportionately) if C:D = A:B; two columns would also be "the same" if A:C = B:D. They might be a little different, but you might get three heads out of four coin tosses, too. Statistically the Null Hypothesis is that both rows have close to the same proportions and any difference could be random. And calculating that random probability is called the Fisher Exact Test.

Fisher used it for the Lady Tasting Tea. Sit back, it's story time!

It was 1919 or perhaps 1922, he was nice, he made some tea for his co-worker's fiancé, who was working a lot with algae, so maybe she was sensitized by all the slime around, she refused it. She said, You put the milk into the tea, I prefer the opposite. He scoffed, of course there's no difference which order you pour in, hahaha. And her aspiring fiancé is there having tea too. He obviously wanted to be on her side, because later he did become her fiancé, and later still her husband, but he was probably also working with or for Fisher so he had to take his side too, so being caught in the middle, he said, "Let's test her." Apparently they did the experiment, and apparently she passed, amazing everyone. (Did you ever microwave tea with milk and notice a difference? Exactly.) Well, at least by 1935 when Fisher published it, this made a fine story about the advancement of science by having a tea party, and about men opening their minds enough to possibly listen to women, at least if she is your fiancé, and about Fisher, who turned the humiliation of his arrogance into a triumph for the world. Muriel Bristol, the magician of the story, was promptly forgotten, but for a type of green algae named for her, C. Muriella, but who knows that. Except you can read about it here and here.
There were two groups, tea first then milk, and milk first then tea. And two labels: the lady called it one way, and the lady called it the other way. The groups are about the tea itself, the labels are about the lady herself, thus: "Tea Lady". With 4 cups in each group, and all presented randomly on the tray, the probability she could have got them all right by random choice is: 4/8 that she got the first choice right (right?), times 3/7 that she got the second choice right, times 2/6 she got the third choice right, times 1/5 she got the last choice right. (4*3*2*1)/(8*7*6*5) = 24/1680 = 1/70.
   A       (A-1)        (A-2)            (2)        (1)
------- * --------- * --------- * ...  -------  * -------
B+C+D+A  (B+C+D+A-1) (B+C+D+A-2)      (B+C+D+2)  (B+C+D+1)
If Muriel was just guessing, still, the probability of getting all her guesses right in such an experiment is 1/70. Less than 5%, less than 2%, more than 1%. Pretty improbable. The fiancé later said she got enough right to prove her point, which if it was 8 cups 50/50 means she got them all right, which we can calculate to be a low probability random event, which means she very likely wasn't guessing randomly.

The Fisher Exact Test is calculated pretty much like this; it's a little more complicated if she got some wrong, then those could have been in any order, so you have to count those orders in and divide them out again to get the answer. I won't explain how that goes but it's called Combinatorics and it's in the first part of probability class, so you can probably read about it and figure it out by yourself if you like.

Anyway by this reasoning we can calculate the probability of the four cell table coming out of pure random chance, given the row counts and the column counts. So that's what we do here.

Fear and Courage

A final point, and a footnote.

The final point is, just don't stop here; there's a lot more to learn. Statistics and experiment design are worth study. I encourage you! Are your observations consistent and reliable? Is your data more complex than two labels on two groups? Did you define your experiment in advance so the chips could fall honestly?

Tea Lady, a.k.a. the Fisher Exact Test, is now your super power.

You must decide whether you will use it for good or for evil.

I'm afraid Tea Lady allows you to cheat, in a way, by collecting data until you get a result you like, which is a potentially dangerous idea that should encourage your skeptics. Every new data point could be considered a separate experiment which will give you a new P value, and if after 20 of those experiments you got one ith P<0.05, then that wouldn't be very meaningful, would it? You should find a consistent pattern, such as, more data shrinks P even more, or, if you do the whole thing a few times, it has a low P value each time. Indeed, if I can assume that your cost of data collection isn't that high, then I wouldn't be too happy with you if you stopped one exploratory experiment at P<0.05. If your difference is actually real, and you can collect more data without that much more trouble, then I'd say the onus is on you to collect enough to get P<0.01, or, frankly, you would be hard to separate from the bullshitter that ran 20 experiments to get one reportable result. That's called P-hacking; Not Good.

Tea Lady here does merge exploration with experiment: Getting the idea and proving the idea. If you already know a bit about your problem already, like for example you have an estimate of the different frequencies for the two columns, then you can and should decide in advance how many observations your experiment will collect. With so many non-replicable results out there, let's please not contribute to that. One solution would be to do the experiment twice, once to estimate how much data is needed for a given level of significance, and another to actually test the question: if you get significance both times, then at least your second test wasn't P-hacking.

But wherever you are, perhaps sitting in a chair and watching the unfolding world, here with the Tea Lady is a great place to start.


Structural thinking.

I'm a linguist, except for the fact that linguists are incapable of thinking about or being persuaded by anything that fails to use the structuralist method, which always seemed a curious limitation to me. So the question occurs to me, Is this the same as the structuralist method, but considered statistically? Both methods say, a difference when it occurs differently in two contexts, is a real difference.

The structuralist method is to substitute members of a category into the same context, observing a resulting difference, and therefore to infer a new category, perhaps a subcategory within that category. Let's take count and mass nouns. So for example "there is milk" is a normal sentence of English, so to speak, but "there is house" is not quite so. So here we have a context, "There is ___". And we have members of a category, nouns like house and milk. Some are fine in that context, others are not, and thus we can declare that nouns are made in two subcategories, and if we are the first to notice this we get to give a name to the new categories and proclaim the advancement of science. Yay. In this case they are already known as count nouns like houses chairs people and spoons, and mass nouns like space, furniture, milk, and rice. But generally, we are determining that there is something there, in the difference between two types of things.

Structuralist method uses an explicit context of sameness, which may be present or absent in a Tea Lady description, and structuralist method also uses the behavior in that context to classify elements into the new subcategories, rather than pre-classifying the members of the subcategories just to discover if their behavior is even different in the first place. So these are analogous, similar, parallel, but not quite the same, as methods of science. Actually the difference is a bit subtle.

If the structuralist method required TWO contexts, it would be even more similar. For example consider "There IS __." and "There is MY ___." as contexts, and "milk" and "house" as things in the context. This makes a four-cell table just like the Fisher Exact Test, and if it is so entirely obvious to the linguist and to the linguist's readers that there is something unique in one cell that you don't even have to count it and run the statistic to prove the point, then the reasoning that follows in linguistics is identical to any other scientific reasoning based on Fisher and the Lady Tasting Tea.



Predication is General because it applies to
  • (1) human language sentence structure (syntax),
  • (2) meaningful (ie true) assertions about reality (logical semantics), and
  • (3) interpersonally received messaging (pragmatics)
via, respectively,
  • (1) Subject and Predicate as the basic syntactic structure in many languages,
  • (2) the basic form of statements in logic as found in the predicate calculus, and
  • (3) the Topic and Comment structure of linguistic pragmatics.

Your thoughts?
                                          Feedback is welcome.
Copyright © 2000-2020, Thomas C. Veatch. All rights reserved.
Modified: April 18, 2020