A replication of the experiment in Lisker's thesis was carried out, which makes two points. First, the phonetic difference between two linguistically different sounds can be rather small. Second, pronunciations of the ``same'' sounds by speakers of different dialects can be phonetically different and thus lead to different results. The experiment may also be taken as an example of the improvements in speech-processing technology since the spectrograph machine.
This replication of Lisker (1949) used myself as the speaker, and more modern methods than were available then. 114 tokens of the form [pæ p], and 142 tokens of the form [pp] were spoken in isolation and a good quality cassette tape-recording was made.4.5 Each syllable was spoken as a complete utterance with its own separate intonation contour; a pause separated each adjacent pair of tokens. I took care to produce cardinal [æ] and [] qualities. I transcribe the pep tokens as [php] and the pap tokens as ranging from [phap] to [phæp], with most tokens as [phæp]. My intent was to produce the [æ] in pap not as many North Americans produce the vowel in TRAP -- namely as a vowel more-or-less front-raised away from [æ] -- but as a true [æ] sound. While I succeeded in producing auditorily low-front [æ] tokens, they vary somewhat in how low they are. The result of this is that F1 measurements of the [æ] tokens are more widely dispersed than the F1 measurements of the [] tokens, which I produced more naturally.
The tape-recording was digitized at 12,000 samples/second, with a resolution of 15 bits. The resulting waveform was subjected to automatic segmentation of the onset and offset of the acoustic vowel in the following way. First, the RMS amplitude value was computed 200 times per second using a window 20ms in duration. Then an amplitude threshold was set to an effective value, and the beginnings and endings of the onset of the acoustic vowel were located according to a ``sloppy threshold-crossing'' algorithm developed for the purpose.4.6 This method, with appropriate thresholds, is used to generate segmentation points that define the acoustic vowel. This method is quite accurate because of the abrupt discontinuity in the amplitude contour at the onset and offset of the acoustic vowel; an appropriate threshold can be determined by inspection of the contour at a few boundaries. After the algorithm was run, the results were examined for errors by hand; a slightly different threshold may be chosen, to improve the overall accuracy of segmentation. Finally, gross segmentation errors were eliminated by hand.
Next the formant trajectories are computed using the program, formant, and the default parameters described in step 4 of the 12-step program, page . Then the trajectories of the first three formant frequencies are extracted, within the segmentation points delimiting the acoustic vowel. These formant trajectories are then time-normalized, so that time t=0 occurs halfway between the endpoints. Finally, all tokens of each class are plotted on a time-frequency display, as shown in Figure . Each dot on these charts represents a single estimated formant frequency at a single time-offset relative to the middle of a particular token. Thus a single vowel token of .10 seconds duration would have 3 formants * 10 frames of .01 seconds each = 30 measurements displayed on the chart, where the three formants for a given frame are displayed one above the other, and frames are displayed sequentially, .01 seconds apart on the time axis.
The two charts represent the distribution of formant trajectories within the syllable for the two forms pap and pep. First, observe that the distributions are quite consistent across tokens, especially in F1 for pep. There is a clear trajectory which all tokens follow quite closely.
Second, the differences between the pap trajectories and the pep trajectories seem at first glance quite small. Qualitatively this is certainly true: both have a similar-shaped rise in F1 and fall in F2 across the syllable; they even share a similar dropoff in F3 frequency near the end of the vowel. These movements reflect a slightly ingliding pronunciation, noted above as [].
Third, an apparent difference between the two forms is that the pap tokens seem considerably longer than the pep tokens (mean durations are .17 seconds versus .14 seconds). However this difference is mostly explained by other factors. The set of pap tokens was produced before the set of pep tokens; as time went by the production of tokens speeded up, from about .713 seconds per utterance for the pap's to about .600 seconds per token for the pep's. Thus 84% of the duration differences can be attributed to an increase in overall rate rather than a difference in intrinsic duration of the two vowels. That is, while a small part of the decrease in measured duration from /æ/ to // may be indeed be due to intrinsic duration differences, most of it is due to the simple fact that I was talking faster for // than for /æ/. Normalizing for rate, the difference between the two classes in terms of duration may be unreliable.
Fourth, despite these similarities, the two classes are completely distinguishable, as shown in Figure . When plotted together on the same chart, the pap and pep F1 trajectories have virtually no overlap. This chart shows that a clear difference in vowel quality can be reflected in quite small (about 110Hz), but genuine differences in formant frequency measurements.
Lisker's (1949) results differ from the current results. There, considerable overlap in F1-F2 space was found between measurements of pronunciations of these two words. Here, on the other hand, there is virtually no overlap in F1. One explanation for the differences could be the methods of measurement. Measurement by ruler and eye on a printed spectrogram includes an inherent randomness of 15#1525 to 15#1540Hz even for practiced phoneticians working on good spectrograms with distinct formants.4.7 The difference found in these charts is not much greater than this inherent error, so perhaps the hand-measurement methods used were responsible for the overlap Lisker found. On the other hand, an equally likely explanation is that Lisker's pronunciations of pap and pep at the time were simply more similar to each other than mine were. This explanation drives home once again the difference between phonetics and phonology: pap pronounced by Lisker is the same word as pap pronounced by me; they have the same phonemes in them. But phonetically, my pronunciations and his may be quite different. Lisker is a Philadelphian; his /æ/ phoneme may therefore be raised from cardinal [æ], while I was making a conscious attempt to produce cardinal [æ]. So just because they are the same (lexically) doesn't mean they are really the same (phonetically). The same word, with the same phonemes, in the same context, can systematically be pronounced in different ways.
To summarize this replication of Lisker (1949), I find that while pap and pep in my pronunciations are almost identical in their formant trajectories (F2 and F3 are identical, and the F1 trajectory shape is identical, and quite close in frequency), nonetheless the F1 trajectories have almost no overlap at all, and differ by about 100 Hz in F1 alone. Thus a rather small but consistent difference in just a single formant frequency, can reflect a very clear difference in vowel quality.
The above experiments provide a valid instance of the inference argued for above: rather small differences are found to be extremely consistent, and reflect genuine, audible differences in vowel quality. Lisker (1949) makes the same point: two classes of sounds which are audibly quite different may overlap in F1-F2 space, but have different means. The bottom line, then, is that significant differences, even small ones, between measured formant frequencies between classes of vowels can be taken seriously as reflecting real differences in phonetic quality.
Formant frequencies, and the articulatory dimensions of mouth opening and tongue body frontness, are continuously variable, unlike the discrete (usually binary) dimensions of phonological structure. It is controversial whether the control humans exert over these continuous dimensions in speech production is part of their linguistic competence or not. This hypothesis is implicit in the work of Labov, Yaeger and Steiner (1972) and also in the present work. Humans do demonstrate fine-grained, precise, conscious control over tongue movement in the production of sounds over continuous (or at least, fine-grained) dimensions of articulation and acoustics. The (non-universal) ability to whistle scales shows that that such control is indeed a part of human sound production capabilities. Thus this kind of control is certainly part of human phonetic competence. The question of whether it is also part of human linguistic competence, is explored in later chapters.