Myers Briggs is Bad. Or is it?

I keep hearing how Myers Briggs is terrible science, according to a pro-continuity, anti-discreteness argument, which I think is baloney. Of course the 4 MB dimensions are continuous; anyone who thinks they're not is probably taking middle-brow PR too seriously. Language uses discrete categories and rather like MB, the counter-proposed Big5 also uses Language (in fact much the same language) to characterize personality dimensions, thus people according to Big5 may be characterized as Extroverted or not, for example, although actually in the rubber-meets-road truth of both systems people are somewhere on the Extroversion/Introversion continuum, that is, more or less. I find middle-brow PR to be generally a lot more categorical than continuous reality generally warrants. But it's just so nice to think simply about things, humans will certainly keep doing it anyway, applying linguistic i.e. categorical thinking to continua, just in order to be able to think about it.

David Brooks goes quite a bit further, saying:

The Myers-Briggs test has no scientific validity. About half the people who take it twice end up in entirely different categories the second time around.

Let's evaluate this under this slightly less unrealistic assumption, namely that the MB dimensions are continua. Since we can always normalize data by taking z-scores (subtract sample mean, divide by sample standard deviation), let "MB-character" and "MB-score" each be 4D vectors, each taken from a 4D normal distribution with independent dimensions (covariance matrices are diagonal). Let an "MB test" be the 4-bit binary categorization of an MB score, with for each dimension 1 if score > 0, 0 otherwise.

To answer the hyperventilating, just do a thought experiment with the statistics. Suppose Brooks is right that "half the people taking it twice end up in entirely different categories the second time around". That doesn't actually sound that bad to me.

The point is that variation might or might not be partitioned into MB-characterological variation and MB-test variation, and if MB-characterological variation is non-zero, then it could be more or less than the population variation. Let's have a look.

Brooks seems to say the first is zero. If MB has no validity, then MB-based characterological variation should not exist, MB tests should measure nothing. Notice he didn't claim it has only a little validity, or some validity; he claimed it has no validity. To me that means that all the outcome variability should come from the population variability, because everyone's character insofar as it is characterized by a completely invalid system like MB, would be exactly the same as everyone else's, and therefore each test even from the same person amounts to a random sampling from the whole-population distribution. I mean, I take No for an answer, and "No" means "No"; which part of "No" did we not understand?

Let's (First Model) accept Brooks' statement as null hypothesis: the assertion that the MB test has no validity. Then it has zero validity in characterizing individuals. Then the individual can be eliminated from the model generating MB test results. Then every MB test outcome, including multiple test outcomes from a given individual, is statistically just another sample from the whole population, which is independent of the individual.

Score ~ N(0,1)^4.

(To wit, "The random variable Score is drawn from a 4 dimensional Normal distribution with each mean 0 and each variance 1 and with each dimension independent of each other.")

Under this First Model, how frequently should pairs of tests by the same person yield a different 4-bit category? Each of the 4 scores are positive or negative with equal probability, hence half the time. But this actually means the same MB test result should occur for the same person not half the time as would be the case with a single bit categorization, but 1/2^4 = 1/16 = 0.0625 of the time, because there are four independent tests, each of which is separately likely to differ half the time. So 4-ways-consistency of 50%, seems like a spectacular result, to me. Anyhow First Model would certainly be for shit if as Brooks suggests it's supposed to produce a 50% result. Because First Model predicts a 94% result: after 4 tests done twice, 94% of everyone should have a different result in at least one of the 4 tests, and the proportion of people with the same result on a their second round of 4 tests should be 100%-94% i.e. 6%.. Must I argue further?: it's a bad model of reality as given, and needs to be given a change, at least a degree of freedom to adjust itself.

So now let's try to estimate how bad and how good this could really be, by separating variance into two components in a generative model (Second Model). Let the first, within-individual variation have some non-zero variance V, and the second, within-population variation have another certain (normalized) variance 1. I'm proposing for the sake of a thought experiment, an idealized and generative model (comprising reflective latent variables drawn from model distributions as though measured directly). Here, then, measuring an individual's "Character" simply means taking a random sample from the normalized zero-mean, unit-variance population distribution (4 times, since this is a 4D model):

Character ~ N(0,1)^4.

The value of the random variable Character, i.e., the simulated person's character, is then taken as the true mean of that individual's MB scores. To generate a particular score for an individual means again simply taking a random sample from a 4D normal distribution, but in this case pulling from a shifted normal distribution with mean Char and a fixed variance V across all individuals:

Score ~ N(Character,V)^4.

Next MB Test is derived from Score by the rule, +1 if Score > 0, -1 otherwise. Finally MB Change is a function of two independently sampled MB Test values, 0 if same, 1 if different. We will look at MB Change statistics under the interpretation that:

Change ~ Binomial(P)^4.

After 1000 samples are chosen, if P were 50%, then about 500 Change=1 results would be found. But this will turn out to be a function of V, actually, when we follow the generating algorithm to randomly produce Change values.

To see this, we can run bunch of a Monte Carlo simulations to see what proportion P of MB Change values are 1, out of say, 1000 randomly generated cases.

MB.change.R is the R source code to generate and calculate these values, and Figure 1 shows a graph of the randomly generated results. A total of 1000 MB Change values are generated for each of 1000 MB Characters, (4 dimensions per MB Score and 2 MB Scores per MB Change, yields 1000 * 1000 * 4 * 2 = 8 million randomly generated values behind each of the points on the graph). So it's quite smooth as you would expect.


Visibly, adjusting V until P approximates 50% yields P=500/1000 at about V=0.14. Note that 1/7 is about 0.1428. So it's a bit less than V=1: 6/7 less.

One way to think of this is to consider the ratio between these variances, the within-individual variance V,and the within population variance (normalized to 1) as describing their relative importance. Because the population variance is normalized to one, the ratio between them is V/1, which is equal to just V itself.

Now when V is larger than 1, the model is describing a scenario in which population variance is smaller than individual variance. If individuals were all quite the same but in this test they naturally had widely variable scores, then their average or true mean scores would be tightly clustered around the global mean, while V would be larger than 1. This would be consistent with Brooks' claim that

"The Myers-Briggs test has no scientific validity."

To repeat, if the individual variance V were larger than the population variance, then indeed we might correctly say MB captures only non-individual-character-derived variation within the population, or that is, nothing informative about individuals, which is what MB claims to provide. Brooks would be right.

However instead of V being larger than 1 consistent with Brooks, V is smaller than 1. Much smaller than 1.

Another way to put it is, a personality test of four binary discrete values yielding changes only half the time would seem to remove about 6/7 of the variance from the data.

Smart reader: you noticed I said "variance" not "standard deviation"! Okay, it's not so very super tight in the SD domain, because the sd=sqrt(var) relationship blows up our satisfying 14% to about 37%. But still it remains far less than 100%, so it's quite a bit less than the population standard deviation (also normalized at 1.0 because sqrt(1)=1), and someone who claims MB is *complete* crap will have to explain why it has to clusters individual score deviations down to 37% of the population average deviation in order to get only 50% changes in 2nd-time testing. I mean, it's not everything, but it's something. So the blowsy critics will have to backpedal, and they will say, it does get something.

But then the strong claim that it has no scientific validity, that it predicts nothing, is, to be polite, too strong. Or in other words, claims of bullshit are themselves, actually, bullshit.

Which is what I'm sayin'.

Your thoughts?
(will not be shared or abused)
                                          Feedback is welcome.
Copyright © 2024 Thomas C. Veatch. All rights reserved.
Created: May 19, 2024