A Twelve-Step Program for Acoustic Analysis

Next: Impressionistic Coding for Phrasal Up: Analysis Procedures Previous: Background

A Twelve-Step Program for Acoustic Analysis

A 12-step AA program is described here, which I used to convert a tape-recording into a summary file of phonological and acoustic information characterizing all the vowels in a discourse, suitable for input into a statistical software package for data display and analysis, which is used to generate the charts and statistics that form the essence of this book. I hope that this program will be useful to others, but also that they will not simply accept it unquestioningly.

The steps in this procedure require very different amounts of time. Some take only seconds (steps 8, 9, 12), while the others are quite slow. The central step in the procedure, that of locating nuclei and correcting erroneous formant-tracks (step 10), is the most labor-intensive part, despite the improvements in efficiency of an order of magnitude or so, introduced here.

(1) Suitable sections of the tape-recording were located, by finding the most animated, un-self-conscious speech, preferably narratives, for a total of 6 to 10 minutes of suitable speech (but 25 minutes were taken for the first speaker, Rosie S. from Chicago). This results in roughly 2000 vowels' worth of speech for each speaker (exact numbers in Table below). This unit of speech closely matches the definition of the ``idiolect'', by Bloch (1948): The speech of one speaker speaking on a single topic for a short period of time.

(2) The desired section was digitized at a sample rate of 12,000 Hz, with a sample resolution of 15 bits.

(3) The resulting large waveform file was broken into smaller chunks separated by periods of silence. This unit of speech was defined by Bloch (1948) as the linguistic ``utterance''. Break points are typically located in the middle of silent periods, but occasionally when the period between silences contains an unwieldy amount of speech, a break-point was be inserted between intonational phrases even if they are not separated by definite periods of silence. For example, if there appears to be phrase-final lengthening and drop-off in pitch associated with the end of a sentence or other large phrase, then I sometimes insert a break-point between that and the beginning of the immediately following sounds. Similarly at the end of a false-start a break-point may be inserted. Each such delimited segment will be called an ``utterance''.

(4) Each utterance was formant-tracked.^5.1 Formant-tracking subjected the signal to a number of operations. The entire waveform was downsampled to 10,000 samples/second and high-pass filtered to remove DC and low-frequency rumble. LPC poles were computed 100 times per second, using 12th order autocorrelation LPC, with a pre-emphasis factor of 0.7, and an analysis window of 49ms, weighted by a cosine4 weighting function. While the window is relatively long, the weighting function reduces its effective width considerable, relative to a rectangular window. One Chicago speaker, Rita, was analysed with order 10 autocorrelation LPC, pre-emphasis factor of 0.85, to optimize the LPC analysis and formant-tracking parameters to her speech. The remainder were satisfactorily tracked using the above default values. Not all the LPC spectral peaks represent formants: some of them are used to model the overall shape or tilt of the spectrum, while others are located on spectral peaks which may be identified as spurious or temporally discontinuous or which have a wide bandwidth). A post-processing technique^5.2 is therefore used to eliminate the spurious LPC peaks, and label a fixed number of formants (the default number is 4; 3 were tracked for Rita).

(5) The entire discourse was transcribed orthographically (except for Jamaican, which was transcribed directly in phonemic form).

(6) A dictionary of orthographic words and their phonological forms was created. Many or all instances of each word were listened to, in determining the phonological form(s) of the word. The orthographic transcription was converted into a phonological transcription according to the dictionary. This is done by a conceptually simple program written for the purpose, which replaces each word's orthographic form by its phonological form and inserts word boundaries. Morphophonological alternations, phonological external sandhi effects (such as // vs. /i:/, etc.), are accounted for by providing different phonological forms for certain words.

(7) All the vowels were impressionistically coded for phrasal stress, as described in Section , below.

(8) All the vowels were extracted from the transcript, along with their stress levels and their adjacent consonantal and vocalic contexts. This was done automatically by a program written for the purpose.

(9) Each vowel was classified as occurring in a clitic or non-clitic word, according to the criteria given in Section below, in order to be able to exclude tokens whose surface phonological forms may be indeterminate due to phonological processes of reduction.

(10) The temporal locations of the onset, nucleus, and offset of each measurable vowel was marked. This step is the most painstaking and time-consuming one, taking about an hour to do 150 vowels, and a day to do 600 (at a sane pace, though greater rates are possible in spurts). Each utterance's waveform is displayed on a computer graphics terminal; a broad-band spectrogram on an expanded time- and frequency- scale is computed using special signal-processing hardware which does the task in near real-time (that is, calculates a spectrogram of one second of speech in about one second). The raw LPC peaks as well as the automatically-classified formants are overlaid in color on top of the gray-scale spectrogram. The user is prompted to specify a temporal location for each of the segmentation marks, by clicking a button on a mouse-controlled pointer on the screen. At this point also, the formant tracks are examined to see if they correspond to the auditory quality of the sound and to check that they follow real resonances shown in the spectrogram. Both the raw LPC peaks and the automatically-classified formant tracks are overlaid directly on top of the spectrogram. This allows for immediate visual checking of the correspondence of the automatically-measured formants to true resonances that are evident on the spectrogram. Sometimes mistracking occurs; a formant measurement may occasionally appear at a point where there is no energy in the spectrogram, (for example, between two resonances). When mistracking occurs, there is almost always some LPC peak at the correct frequency location, which has been incorrectly ignored by the post-processing algorithm. For this situation, the software provides a facility for modifying the labelling of LPC peaks. One selects a formant with a mouse-controlled pointer, and draws the pointer over the correct LPC peaks, and those peaks are relabelled as formants. Thus, formants can be re-associated with LPC poles that lie on visible spectral peaks in the spectrogram. This procedure is used to correct all the erroneous formant tracks, during a detailed, vowel-by-vowel examination. The onset and offset of the acoustic vowel are located at acoustic discontinuities where they occur (with a precision of around 3 milliseconds) (see the definition of ``acoustic vowel'' in Chapter 4) and at points of maximum spectral change where they do not. The nucleus of the vowel is located according to the methods in Section below, which discusses nucleus-picking. If there is no acoustic vowel corresponding to a particular vowel phoneme, it is not marked. If there is overlapping speech or other noise, distorting the formant measurements at any point in the syllable, that vowel token is thrown out, deleted from the corpus. The total numbers of unmarked, deleted, and measured tokens are given in Section below.

(11) Extract the formant frequencies from the corrected formant-track files corresponding to the time-locations of the marked nuclei.

(12) Collect all these pieces of information together into a grand summary file, which contains the following 16 pieces of information in columns.

a unique token ID number,
the vowel phoneme,
the preceding and following vocalic and consonantal contexts (4 character strings),
the impressionistically coded phrasal stress level,
the clitic vs. non-clitic classification,
the phonological form of the word in which the vowel occurs,
the file containing the waveform for that utterance,
the time of the onset, nucleus, and offset of the acoustic vowel,
F1, F2, and F3 at the marked nucleus time.

There is some large number of lines per file (exact numbers given in Table below), one line for each token measured for the speaker in question. This summary file becomes the input to a data display, analysis, and programming package called S.^5.3 The various charts, statistics, and other results given in later chapters are generated by operations on the data within that statistics package. For example, an F1-F2 plot of all measurements for a speaker is created by the command, plot(-spkr$f2,-spkr$f1). (Remember that plotting -F2 vs. -F1 gives the traditional orientation of vowel charts, with [i] at the upper left, [u] at the upper right, and [] at the bottom in the middle.)

Next: Impressionistic Coding for Phrasal Up: Analysis Procedures Previous: Background

Thomas Veatch 2005-01-25