Tea Lady

A data collector app with Fisher's Exact Test

For a scientist in the field: You



Introduction

Tea Lady helps you to classify and interpret observations. Each observation is going to be a yes or a no for each of two different questions; you can label the columns and rows yourself. Then for each one you actually observe, you tap one of the four cells. Watch the counts go up; TeaLady keeps track for you and calculates the probability that reality answers the two questions independently. If that probability gets tiny, then you probably have a difference (in one question) that makes a difference (for the other question).

Try it now with eight taps. Label the rows "Yes, Poured milk first" and "No, Poured tea first", and label the columns "Yes, Guessed milk first" and "No, Guessed tea first". Tea Lady will tell you the likelihood that there's really nothing going on. Punch in to get 4,0;0,4 and you will get P<0.01429: The Tea Lady was right!

I hope you will find many uses for this app. Like: driving down the road in Kenmore, we see men and women (seemingly), wearing facemasks for the coronavirus pandemic, and not wearing facemasks. Being curious whether this relates to the fact that men die from the virus more, we start tapping away, man with mask, man without mask, woman without mask, woman with mask. If it's really clear then we might not need too much data, but if the effect is small we might need a lot to tell the difference between frequencies of wearing masks. I can tap away making observations wherever I am using my phone, and Tea Lady will tell me how probable it is to make those counts without any difference between men and women in their face-mask-wearing behavior. (I finally actually did this experiment on June 23, 2020 in Seattle, and by the time I got a total number over 100, the result was P=0.021, and yes it was men wearing fewer masks.)

Simple Science

Science can get complicated, but the basics can be simple. The base case in science is simply saying stuff about stuff. We call it predication. Say, "Roses are red", for example, or quite generally Xs are Y: the predicate or property Y, like being red, applies to the category or group X, like roses. Predication is simple and general . Almost anything you might say is like this. Any attribute, property, or assertion about any category or group of things has exactly this structure: (1) the property applies to (2) the category and it's (3) true. And we don't just believe any old thing because someone said so even if it's you or me or the boss who said so. No. Reality has to say so.

So let's say we want to know if some category of things has some property in reality. Then we design an experiment that lets us observe and count cases so as to calculate the likelihood our assertion is false. How? We need four counts, in our four-celled table with columns labelled in-category and out-of-category, and rows labelled with-property and without-property. Then we can compare the differences. Two differences to see if the difference makes a difference. That's how Tea Lady thinks.

There might be no difference, if yes and no counts in both sets of rows are roughly similar. It could be 50:50 and 50:50, that would be no information. But it could also be 0:50 and 0:100, that's also no information because there's no difference in the first column to compare to the second. Or consider if it's 10:90 in one row, and 100:900 in the second, then still these ratios are the same, and there's no news there. The idea is that in these cases the rows and columns are independent. We call this the Null Hypothesis, meaning that the difference between the rows is not different in one column versus the other. The rows difference makes no difference to the column one. This Null Hypothesis actually lets us do math to calculate the likelihood of the observed counts. If the likelihood under the Null Hypothesis is very low, THEN we conclude that the Null Hypothesis is likely to be wrong, and therefore that what's left over, our original assertion, is likely to be true. It's backward reasoning, but it's tight. This is the language of statistical proof. The null hypothesis is the assertion that nothing is going on here, that your assertion fails to predict, that the association you would expect doesn't show up in the data. If the null hypothesis can be shown very unlikely, then you win, your assertion is left standing.

Make sense? Okay, let's do it, then. So here's the scientist, you, and you consistently observe some actual reality observeable over and over again on many different occasions, and you count up four kinds of cases: The cases that are in that category and not in that category, and the cases which do and don't have that possibly-significant property. This makes a four celled table like the one above.

What was hard for me to wrap my head around is that this isn't just counting out a dozen, say, red roses. Your counts need to show a difference, a real difference, a difference that makes a difference. Because maybe you wear rose colored glasses and everything is red to you; no-one should have to believe someone wearing rose colored glasses when they say roses are red; there has to be stuff that's not red, for it to mean anything. So what kind of difference counts, and what makes it a real difference?

Soon it will be a nice time for a story: the Tea Lady story. Once upon a time, it was 1935, when a curious and thoughtful Englishman named Fisher wrote about this in The Mathematics of a Lady Tasting Tea to explain why, and how to calculate whether, some counts show a real difference.

For Fisher, what is a difference? It's not two numbers, it's four. A difference shows a real difference if it is two differences and THE DIFFERENCES ARE DIFFERENT.

Suppose you have ten rookies, and ten veterans, is that a difference? No, not yet! You may say they're different, but your words are meaningless unless you can (statistically) prove it. Maybe, for example, the veterans actually never learned anything because the coaching is useless. You have to show they are different by observing a difference. So you put them to a task, say, and maybe mostly the veterans can do it, and mostly the rookies can't. That could be a difference. If they are all about the same, then they aren't actually different. So the setup has to be

Rookies That Can, Rookies that Can't,
Veterans That Can, Veterans that Can't.
Four numbers. Like this.

RookiesVeteransSide Total
Can A B A+B
Can't C D C+D
Bottom Total A+CB+D The Total: A+B+C+D

Fisher gave us the math; thank you, Ronald! He uses those four numbers, A, B, C, and D, to calculate the probability it was just random. If you assume instead that there's no actual difference between the groups, the counts might still look like that by random chance. Random means you use the totals for the columns to make a global frequency or column probability. 110 vs 990, says mostly stuff is going to land in column 2: how often? 990/(110+990)= 990/1100 = 9/10 of the time. Same with the rows, use the totals for the rows to make a global frequency or row probability. This lets you calculate the probability of all four cells using the global frequencies, and that's under the Null Hypothesis.

But sometimes, oh, with the special combination of this row and this column there's a special bump in the frequency because the combination of that column and that row doesn't fit very well; it's special. Oh No! not that!

"No special bumps" is called the Null Hypothesis, and even though it's Null it still tells us enough that we can use it to actually calculate the probability of a random event like your actual counts based on the Null Hypothesis being true. Two rows would certainly be "the same" (proportionately) if C:D = A:B; two columns would also be "the same" if A:C = B:D. They might be a little different, but you might get three heads out of four coin tosses, too. Statistically the Null Hypothesis is that the columns in both rows have close to the same proportions and any difference is small, potentially random. And calculating that random probability is called the Fisher Exact Test.

Fisher used it for the Lady Tasting Tea. Sit back, it's story time!

It was 1919 or perhaps 1922, he was nice, he made some tea for his co-worker's fiancé. She was working a lot with algae, so maybe she was sensitized by all the slime around; she refused it! She said, You put the milk into the tea, I prefer the opposite. He scoffed, of course there's no difference which order you pour in, hahaha.

But there is her aspiring fiancé, he's having tea too. Obviously he had to be on her side -- because later he did become her fiancé, and later still her husband -- but he was probably also working for Fisher so he had to take his boss's side too. So, being caught in the middle, he said, "Let's test her!"

Apparently they did the experiment, and apparently she passed, amazing everyone. (Did you ever microwave tea with milk and notice a difference? Exactly.) Well, at least by 1935 when Fisher published it, this made a fine story about the advancement of science by having a tea party, and about men opening their minds enough to possibly listen to women, at least if she is your fiancé, and about Fisher, who turned the humiliation of his arrogance into a triumph for the world. Muriel Bristol, the magician of the story, was promptly forgotten, but for a type of green algae named for her, C. Muriella, but who knows that. Except you can read about it here and here.

What was the experiment? We need to know this, because we are doing something very similar with our predications about categories. Fisher made two groups, 'tea first then milk', and 'milk first then tea'. And two labels: 'the lady called it one way', and 'the lady called it the other way'.

The groups are about the tea itself, the labels are about the lady herself, thus: "Tea Lady".

With 4 cups in each group, and all presented randomly on the tray, the probability she could have got them all right by random choice is: 4/8 that she got the first choice right (right?), times 3/7 that she got the second choice right, times 2/6 she got the third choice right, times 1/5 she got the last choice right. (4*3*2*1)/(8*7*6*5) = 24/1680 = 1/70.
   A       (A-1)        (A-2)            (2)        (1)
------- * --------- * --------- * ...  -------  * -------
B+C+D+A  (B+C+D+A-1) (B+C+D+A-2)      (B+C+D+2)  (B+C+D+1)
If Muriel Bristol was just guessing randomly with a 50:50 coin toss, still, the probability of getting all her guesses right in such an experiment is 1/70. Less than 5%, less than 2%, more than 1%. Pretty improbable. The fiancé later said she got enough right to prove her point, which if it was 8 cups 50/50 means she got them all right, which we can calculate to be a low probability random event, which means she very likely wasn't guessing randomly.

The Fisher Exact Test is calculated pretty much like this; it's a little more complicated if she got some wrong, then those could have been in any order, so you have to count those orders in and divide them out again to get the answer. I won't explain how that goes but it's called Combinatorics and it's in the first part of probability class, so you can probably read about it and figure it out by yourself if you like.

Anyway by this reasoning we can calculate the probability of the four cell table coming out of pure random chance, given the row counts and the column counts. So that's what we do here.

Fear and Courage

A final point, and a footnote.

The final point is, just don't stop here; there's a lot more to learn. Statistics and experiment design are worth study. I encourage you! Are your observations consistent and reliable? Is your data more complex than two labels on two groups? Did you define your experiment in advance so the chips could fall honestly?

Tea Lady, a.k.a. the Fisher Exact Test, is now your super power.

You must decide whether you will use it for good or for evil.

I'm afraid Tea Lady allows you to cheat, in a way, by continuing to collect data just until you get a result you like, which is a potentially dangerous idea that should encourage your skeptics. Every new data point could be considered a separate experiment which will give you a new P value, and if after 20 of those experiments you got one ith P<0.05, then that wouldn't be very meaningful, would it? You should find a consistent pattern, such as, even more data shrinks P even more, or, if you do the whole thing a few times, it has a low P value each time. Indeed, if I can assume that your cost of data collection isn't that high, then I wouldn't be too happy with you if you stopped one exploratory experiment at P<0.05. If your difference is actually real, and you can collect more data without that much more trouble, then I'd say the onus is on you to collect enough to get P<0.01, or, frankly, you would be hard to separate from the bullshitter that ran 20 experiments to get one reportable result. That's called P-hacking; Not Good.

Tea Lady here does merge exploration with experiment: Getting the idea and proving the idea. If you already know a bit about your problem already, like for example you have an estimate of the different frequencies for the two columns, then you can and should decide in advance how many observations your experiment will collect. With so many non-replicatable results out there, let's please not contribute to that. One solution would be to do the experiment twice, once to estimate how much data is needed for a given level of significance, and another to actually test the question: if you get significance both times, then at least your second test wasn't P-hacking.

But wherever you are, perhaps sitting in a chair and watching the unfolding world, here with the Tea Lady is a great place to start.

Footnotes

Structural thinking.

I'm a linguist, except for the fact that linguists are incapable of thinking about or being persuaded by anything that fails to use the structuralist method, which always seemed a curious limitation to me. So the question occurs to me, Is this the same as the structuralist method, but considered statistically? Both methods say, a difference when it occurs differently in two contexts, is a real difference.

The structuralist method is to substitute members of a category into the same context, observing a resulting difference, and therefore to infer a new category, perhaps a subcategory within that category. Let's take count and mass nouns. So for example "there is milk" is a normal sentence of English, so to speak, but "there is house" is not quite so. So here we have a context, "There is ___". And we have members of a category, nouns like house and milk. Some are fine in that context, others are not, and thus we can declare that nouns are made in two subcategories, and if we are the first to notice this we get to give a name to the new categories and proclaim the advancement of science. Yay. In this case they are already known as count nouns like houses chairs people and spoons, and mass nouns like space, furniture, milk, and rice. But generally, we are determining that there is something there, in the difference between two types of things.

Structuralist method uses an explicit context of sameness, which may be present or absent in a Tea Lady description, and structuralist method also uses the behavior in that context to classify elements into the new subcategories, rather than pre-classifying the members of the subcategories just to discover if their behavior is even different in the first place. So these are analogous, similar, parallel, but not quite the same, as methods of science. Actually the difference is a bit subtle.

If the structuralist method required TWO contexts, it would be even more similar. For example consider "There IS __." and "There is MY ___." as contexts, and "milk" and "house" as things in the context. This makes a four-cell table just like the Fisher Exact Test, and if it is so entirely obvious to the linguist and to the linguist's readers that there is something unique in one cell that you don't even have to count it and run the statistic to prove the point, then the reasoning that follows in linguistics is identical to any other scientific reasoning based on Fisher and the Lady Tasting Tea.

Tadaa!

Predication

Predication is General because it applies to
  • (1) human language sentence structure (syntax),
  • (2) meaningful (ie true) assertions about reality (logical semantics), and
  • (3) interpersonally received messaging (pragmatics)
via, respectively,
  • (1) Subject and Predicate as the basic syntactic structure in many languages,
  • (2) the basic form of statements in logic as found in the predicate calculus, and
  • (3) the Topic and Comment structure of linguistic pragmatics.

Your thoughts?
   Name/Email 
Comment:
                                          Feedback is welcome.
Copyright © 2000-2020, Thomas C. Veatch. All rights reserved.
Modified: April 18, 2020