A 12-step AA program is described here, which I used to convert a tape-recording into a summary file of phonological and acoustic information characterizing all the vowels in a discourse, suitable for input into a statistical software package for data display and analysis, which is used to generate the charts and statistics that form the essence of this book. I hope that this program will be useful to others, but also that they will not simply accept it unquestioningly.
The steps in this procedure require very different amounts of time. Some take only seconds (steps 8, 9, 12), while the others are quite slow. The central step in the procedure, that of locating nuclei and correcting erroneous formant-tracks (step 10), is the most labor-intensive part, despite the improvements in efficiency of an order of magnitude or so, introduced here.
(1) Suitable sections of the tape-recording were located, by finding the
most animated, un-self-conscious speech, preferably narratives, for a
total of 6 to 10 minutes of suitable speech (but 25 minutes were taken
for the first speaker, Rosie S. from Chicago). This results in
roughly 2000 vowels' worth of speech for each speaker (exact numbers in
Table below). This unit of speech closely matches the
definition of the ``idiolect'', by Bloch (1948): The speech of one
speaker speaking on a single topic for a short period of time.
(2) The desired section was digitized at a sample rate of 12,000 Hz, with a sample resolution of 15 bits.
(3) The resulting large waveform file was broken into smaller chunks separated by periods of silence. This unit of speech was defined by Bloch (1948) as the linguistic ``utterance''. Break points are typically located in the middle of silent periods, but occasionally when the period between silences contains an unwieldy amount of speech, a break-point was be inserted between intonational phrases even if they are not separated by definite periods of silence. For example, if there appears to be phrase-final lengthening and drop-off in pitch associated with the end of a sentence or other large phrase, then I sometimes insert a break-point between that and the beginning of the immediately following sounds. Similarly at the end of a false-start a break-point may be inserted. Each such delimited segment will be called an ``utterance''.
(4) Each utterance was formant-tracked.5.1 Formant-tracking subjected the signal to a number of operations. The entire waveform was downsampled to 10,000 samples/second and high-pass filtered to remove DC and low-frequency rumble. LPC poles were computed 100 times per second, using 12th order autocorrelation LPC, with a pre-emphasis factor of 0.7, and an analysis window of 49ms, weighted by a cosine4 weighting function. While the window is relatively long, the weighting function reduces its effective width considerable, relative to a rectangular window. One Chicago speaker, Rita, was analysed with order 10 autocorrelation LPC, pre-emphasis factor of 0.85, to optimize the LPC analysis and formant-tracking parameters to her speech. The remainder were satisfactorily tracked using the above default values. Not all the LPC spectral peaks represent formants: some of them are used to model the overall shape or tilt of the spectrum, while others are located on spectral peaks which may be identified as spurious or temporally discontinuous or which have a wide bandwidth). A post-processing technique5.2 is therefore used to eliminate the spurious LPC peaks, and label a fixed number of formants (the default number is 4; 3 were tracked for Rita).
(5) The entire discourse was transcribed orthographically (except for Jamaican, which was transcribed directly in phonemic form).
(6) A dictionary of orthographic words and their phonological forms was created. Many or all instances of each word were listened to, in determining the phonological form(s) of the word. The orthographic transcription was converted into a phonological transcription according to the dictionary. This is done by a conceptually simple program written for the purpose, which replaces each word's orthographic form by its phonological form and inserts word boundaries. Morphophonological alternations, phonological external sandhi effects (such as // vs. /i:/, etc.), are accounted for by providing different phonological forms for certain words.
(7) All the vowels were impressionistically coded for phrasal stress,
as described in Section , below.
(8) All the vowels were extracted from the transcript, along with their stress levels and their adjacent consonantal and vocalic contexts. This was done automatically by a program written for the purpose.
(9) Each vowel was classified as occurring in a clitic or non-clitic
word, according to the criteria given in Section below, in
order to be able to exclude tokens whose surface phonological forms
may be indeterminate due to phonological processes of reduction.
(10) The temporal locations of the onset, nucleus, and offset of each
measurable vowel was marked. This step is the most painstaking and
time-consuming one, taking about an hour to do 150 vowels, and a day
to do 600 (at a sane pace, though greater rates are possible in
spurts). Each utterance's waveform is displayed on a computer
graphics terminal; a broad-band spectrogram on an expanded time- and
frequency- scale is computed using special signal-processing hardware
which does the task in near real-time (that is, calculates a
spectrogram of one second of speech in about one second). The raw LPC
peaks as well as the automatically-classified formants are overlaid in
color on top of the gray-scale spectrogram. The user is prompted to
specify a temporal location for each of the segmentation marks, by
clicking a button on a mouse-controlled pointer on the screen. At
this point also, the formant tracks are examined to see if they
correspond to the auditory quality of the sound and to check that they
follow real resonances shown in the spectrogram. Both the raw LPC
peaks and the automatically-classified formant tracks are overlaid
directly on top of the spectrogram. This allows for immediate visual
checking of the correspondence of the automatically-measured formants
to true resonances that are evident on the spectrogram. Sometimes
mistracking occurs; a formant measurement may occasionally appear at a
point where there is no energy in the spectrogram, (for example,
between two resonances). When mistracking occurs, there is almost
always some LPC peak at the correct frequency location, which has been
incorrectly ignored by the post-processing algorithm. For this
situation, the software provides a facility for modifying the
labelling of LPC peaks. One selects a formant with a mouse-controlled
pointer, and draws the pointer over the correct LPC peaks, and those
peaks are relabelled as formants. Thus, formants can be re-associated
with LPC poles that lie on visible spectral peaks in the spectrogram.
This procedure is used to correct all the erroneous formant tracks,
during a detailed, vowel-by-vowel examination. The onset and offset
of the acoustic vowel are located at acoustic discontinuities where
they occur (with a precision of around 3 milliseconds) (see the
definition of ``acoustic vowel'' in Chapter 4) and at points of
maximum spectral change where they do not. The nucleus of the vowel
is located according to the methods in Section below,
which discusses nucleus-picking. If there is no acoustic vowel
corresponding to a particular vowel phoneme, it is not marked. If there
is overlapping speech or other noise, distorting the formant
measurements at any point in the syllable, that vowel token is thrown
out, deleted from the corpus. The total numbers of unmarked, deleted,
and measured tokens are given in Section
below.
(11) Extract the formant frequencies from the corrected formant-track files corresponding to the time-locations of the marked nuclei.
(12) Collect all these pieces of information together into a grand summary file, which contains the following 16 pieces of information in columns.
There is some large number of lines per file (exact numbers given in
Table below), one line for each token measured for the
speaker in question. This summary file becomes the input to a data
display, analysis, and programming package called S.5.3 The various charts,
statistics, and other results given in later chapters are
generated by operations on the data within that statistics package.
For example, an F1-F2 plot of all measurements for a speaker is
created by the command, plot(-spkr$f2,-spkr$f1). (Remember
that plotting -F2 vs. -F1 gives the traditional orientation of vowel
charts, with [i] at the upper left, [u] at the upper right, and
[] at the bottom in the middle.)