I keep hearing how Myers Briggs is terrible science, according to a
pro-continuity, anti-discreteness argument, which I think is baloney.
Of course the 4 MB dimensions are continuous; anyone who thinks
they're not is probably taking middle-brow PR too seriously. Language
uses discrete categories and rather like MB, the counter-proposed Big5
also uses Language (in fact much the same language) to characterize
personality dimensions, thus people according to Big5 may be
characterized as Extroverted or not, for example, although actually in
the rubber-meets-road truth of both systems people are somewhere on
the Extroversion/Introversion continuum, that is, more or less. I find
middle-brow PR to be generally a lot more categorical than continuous
reality generally warrants. But it's just so nice to think simply
about things, humans will certainly keep doing it anyway, applying
linguistic i.e. categorical thinking to continua, just in order to be
able to think about it.
Score ~ N(0,1)^4.
(To wit, "The random variable Score is drawn from a 4 dimensional
Normal distribution with each mean 0 and each variance 1 and with each
dimension independent of each other.")
Character ~ N(0,1)^4.
The value of the random variable Character, i.e., the simulated
person's character, is then taken as the true mean of that
individual's MB scores. To generate a particular score for an
individual means again simply taking a random sample from a 4D normal
distribution, but in this case pulling from a shifted normal
distribution with mean Char and a fixed variance V across all
individuals:
Score ~ N(Character,V)^4.
Next MB Test is derived from Score by the rule, +1 if Score > 0, -1
otherwise. Finally MB Change is a function of two independently
sampled MB Test values, 0 if same, 1 if different. We will
look at MB Change statistics under the interpretation that:
Change ~ Binomial(P)^4.
After 1000 samples are chosen, if P were 50%, then about 500 Change=1
results would be found. But this will turn out to be a function of V,
actually, when we follow the generating algorithm to randomly produce
Change values.

David Brooks goes quite a bit further, saying:

The Myers-Briggs test has no scientific validity. About half the people who take it twice end up in entirely different categories the second time around.

Let's evaluate this under this slightly less unrealistic assumption, namely that the MB dimensions are continua. Since we can always normalize data by taking z-scores (subtract sample mean, divide by sample standard deviation), let "MB-character" and "MB-score" each be 4D vectors, each taken from a 4D normal distribution with independent dimensions (covariance matrices are diagonal). Let an "MB test" be the 4-bit binary categorization of an MB score, with for each dimension 1 if score > 0, 0 otherwise.

To answer the hyperventilating, just do a thought experiment with the statistics. Suppose Brooks is right that "half the people taking it twice end up in entirely different categories the second time around". That doesn't actually sound that bad to me.

The point is that variation might or might not be partitioned into MB-characterological variation and MB-test variation, and if MB-characterological variation is non-zero, then it could be more or less than the population variation. Let's have a look.

Brooks seems to say the first is zero. If MB has **no** validity,
then MB-based characterological variation should not exist, MB tests
should measure **no**thing. Notice he didn't claim it has only a
little validity, or some validity; he claimed it has no validity. To
me that means that all the outcome variability should come from the
population variability, because everyone's character insofar as it is
characterized by a completely invalid system like MB, would be exactly
the same as everyone else's, and therefore each test even from the
same person amounts to a random sampling from the whole-population
distribution. I mean, I take No for an answer, and "No" means "No";
which part of "No" did we not understand?

Let's (First Model) accept Brooks' statement as null hypothesis: the assertion that the MB test has no validity. Then it has zero validity in characterizing individuals. Then the individual can be eliminated from the model generating MB test results. Then every MB test outcome, including multiple test outcomes from a given individual, is statistically just another sample from the whole population, which is independent of the individual.

Under this First Model, how frequently should pairs of tests by the
same person yield a different 4-bit category? Each of the 4 scores are
positive or negative with equal probability, hence half the time. But
this actually means the same MB test result should occur for the same
person **not** half the time as would be the case with a
**single** bit categorization, but 1/2^4 = 1/16 = 0.0625 of the
time, because there are **four** independent tests, each of which
is separately likely to differ half the time. So 4-ways-consistency of
50%, seems like a spectacular result, to me. Anyhow First Model would
certainly be for shit if as Brooks suggests it's supposed to produce a
50% result. Because First Model predicts a 94% result: after 4 tests
done twice, 94% of everyone should have a different result in at least
one of the 4 tests, and the proportion of people with the same result
on a their second round of 4 tests should be 100%-94% i.e. 6%.. Must I
argue further?: it's a bad model of reality as given, and needs to be
given a change, at least a degree of freedom to adjust itself.

So now let's try to estimate how bad and how good this could really be, by separating variance into two components in a generative model (Second Model). Let the first, within-individual variation have some non-zero variance V, and the second, within-population variation have another certain (normalized) variance 1. I'm proposing for the sake of a thought experiment, an idealized and generative model (comprising reflective latent variables drawn from model distributions as though measured directly). Here, then, measuring an individual's "Character" simply means taking a random sample from the normalized zero-mean, unit-variance population distribution (4 times, since this is a 4D model):

To see this, we can run bunch of a Monte Carlo simulations to see what proportion P of MB Change values are 1, out of say, 1000 randomly generated cases.

sw/MB.Change.R is the R source code to generate and calculate these values, and Figure 1 shows a graph of the randomly generated results. A total of 1000 MB Change values are generated for each of 1000 MB Characters, (4 dimensions per MB Score and 2 MB Scores per MB Change, yields 1000 * 1000 * 4 * 2 = 8 million randomly generated values behind each of the points on the graph). So it's quite smooth as you would expect.

MB.Change.pdfVisibly, adjusting V until P approximates 50% yields P=500/1000 at about V=0.14. Note that 1/7 is about 0.1428. So it's a bit less than V=1/7.

One way to think of this is to consider the ratio between these variances, the within-individual variance V, and the within-population variance (normalized to 1) as describing their relative importance. Because the population variance is normalized to one, the ratio between them is V/1, which is equal to just V itself.

Now when V is larger than 1, the model is describing a scenario in
which population variance is **smaller** than individual variance.
If individuals were all quite the same but in this test they naturally
had widely variable scores, then their average or true mean scores would be
tightly clustered around the global mean, while V would be **larger**
than 1. This would be consistent with Brooks' claim that

"The Myers-Briggs test has no scientific validity."

To repeat, if the individual variance V were larger than the population variance, then indeed we might correctly say MB captures only non-individual-character-derived variation within the population, or that is, nothing informative about individuals, which is what MB claims to provide. Brooks would be right.

However instead of V being larger than 1 consistent with Brooks, V is smaller than 1. Much smaller than 1.

Another way to put it is, a personality test of four binary discrete values yielding changes only half the time would seem to remove about 6/7 of the variance from the data.

Smart reader: you noticed I said "variance" not "standard deviation"! Okay, it's not so very super tight in the SD domain, because the sd=sqrt(var) relationship blows up our satisfying 14% to about 37%. But still it remains far less than 100%, so it's quite a bit less than the population standard deviation (also normalized at 1.0 because sqrt(1)=1), and someone who claims MB is *complete* crap will have to explain why it has to clusters individual score deviations down to 37% of the population average deviation in order to get only 50% changes in 2nd-time testing. I mean, it's not everything, but it's something. So the blowsy critics will have to backpedal, and they will say, it does get something.

But then the strong claim that it has no scientific validity, that it predicts nothing, is, to be polite, too strong. Or in other words, claims of bullshit are themselves, actually, bullshit.

Which is what I'm sayin'.