Acoustic segments which can be easily be identified and delimited on spectrograms are analytically convenient objects, but their status in a theory of linguistic performance is far from clear. It is quite unlikely, for example, that the timing patterns of acoustically well-defined events constitute, in themselves, the underlying speech timing pattern. Phonetic timing data are observationally convenient, and form the raw materials for studies of linguistic timing. But analytically convenient fictions do not necessarily reflect the true categories of the phonological production mechanism. Many patterns will be artifacts. For example, the relation between underlying speech timing and the acoustically well-defined moments like burst-onset or the moment of voicing onset, etc., is unclear, as is evident from continuing difficulties in establishing phonetic reality for ``syllable-timing'' and ``stress-timing'' patterns in various languages. Similarly, the timing of ``magic moments'' like burst onset, voicing onset, and other consistently measurable parts of the speech signal, must be related to or somehow derived from intonational structure, inherent segmental timing patterns, etc., in a temporal theory of phonetic implementation, but such sharply-defined acoustical segments are not themselves the fundamental underlying units of production.
Despite these reservations, acoustic segments are the basic data for much speech research, and in particular for the present work. We must be clear about what we are talking about; therefore definitions are important.
It is also clear that acoustic segments are not unrelated to the underlying production mechanisms. Burst onset, for example, occurs at a particular point during the tongue- or lip-gesture that closes and opens the vocal tract at a particular place of articulation. The burst is distinct from the gesture, but it may be used to locate a part of the gesture in time, and the spectral and temporal information in the burst carries perceptual cues as to the underlying gesture that occurred. Further, there are exceptions: those gestures which typically have particular acoustic events associated with them don't always co-occur with these events. For example, expiratory force may be so weak in some performances that the burst is unmeasurable. Thus, typical patterns may not accurately describe all instances. Nonetheless, we proceed by examining and characterizing what we think are typical instances, pointing out along the way those cases that are excluded or that do not fit the typical patterns. The existence of exceptions does not prevent us from characterizing the typical patterns, and from basing a partial understanding of the underlying system on these patterns. This is the strategy in this work, which is primarily a study of acoustic vowels.
The reason acoustic vowels are interesting is that we understand fairly well how they are produced, and we can infer from their acoustic structure what the mouth is doing, in some detail (cf. Chapter 2, Acoustics). On the other hand, during noise-excited segments, the formant structure is typically less clear (F1 during [s], for example, is hard to locate precisely by any objective means, as a spectrogram will make clear), and thus articulatory movements and audible vowel color are less clear.
Therefore acoustic vowels are a very useful object of description. However, as follows from these definitions, the realization of an underlying vowel is not composed entirely of the corresponding acoustic vowel; nor is the realization of an underlying consonant entirely constituted by a corresponding acoustic consonant. Coarticulation occurs very widely in the realization of underlying sounds. Consequently, acoustic vowels frequently contain some aspects of the realization of underlying consonants, just as acoustic consonants often contain aspects of the realization of underlying vowels.
Acoustic vowels are bounded by the beginning and ending of voicing, by onset or offset of frication or oral closure. An acoustic vowel between stops typically begins at the point of post-release voice onset and ends at the point of devoicing or oral closure. Not all vowels are associated with acoustic vowels. Whispered vowels and [h] are not acoustic vowels, nor are devoiced vowels, which most frequently occur (in English as well as in Japanese) between voiceless obstruents, when the vowel is high, unstressed, and front. Such vowels are indeed (underlyingly) vowels, but their formants are difficult to measure, and the movement of the articulators is harder to follow by formant tracking. Because of these difficulties, such vowels are excluded from acoustical study here.
Semivowels are a problem for acoustic segmentation. They are often difficult to delimit in time. Where there are sharp acoustic discontinuities at the boundaries of the semivowel, as with the release of apical [l], those discontinuities define the border between the acoustic semivowel and an adjacent segment such as a vowel. Dark [] without apical contact, and thus without clear acoustic boundaries, is an acoustic vowel, while /l/ with apical contact, and sharp acoustic discontinuities at the point of closure is not. Where there is a definite non-syllabic steady-state segment (as in /w/ where the /w/ may be held in a steady state for some time), the onset of change out of the steady-state into the vowel is the location of the beginning of the acoustic vowel. Postvocalic English /r/ is an acoustic vowel.
It may seem plausible to distinguish between linguistic allophony, in which formal phonological rules of arbitrary expressive power modify abstract phonological segments (symbolic feature structures), and phonetic coarticulation, in which phonetic segments (articulatory gestures) influence each other by virtue of the physical limitations (e.g., sluggishness) of the physical speech production system. However, a middle view is in fact more sensible, in which phonological rules require limitations on expressive power that reflect physical limitations of the speech production device, and in which the ``physical limitations'' of the device may vary from one language or dialect to another.4.3 In such a view, a coarticulated phonetic segment is just an assimilated allophone. ``Coarticulation'' and ``segmentally conditioned allophony'' are in this view identical. Perhaps the most important distinction between the terms is not in the phenomena referred to, but in that ``coarticulation'' is more common among phoneticians, while ``allophone'' is more common among phonologists.
The adjacent underlying consonants AND the underlying vowel simultaneously determine the changing formant structure at the edges (as well as in the middle) of acoustic vowels. These transitions are part of the realization of the underlying consonant as well as of the underlying vowel, as recent perceptual studies have amply demonstrated (Strange, et al 1976, and references in Strange 1989). A transition is an important cue to both the underlying consonant and vowel of which it is a joint realization.
Defining ``acoustic consonant'' and ``acoustic vowel'' allows us to say that coarticulatory effects on an acoustic vowel are effects of an underlying (articulatory or phonological) consonant, without implying that the ``actual'' consonant and vowel are realized elsewhere. An acoustic vowel segment is not to be taken as corresponding exclusively to an underlying vowel; it may, and here does, correspond to (i.e., provide perceptual cues for) underlying consonants. Thus coarticulatory effects observable in acoustic vowels are not incidental effects of a consonant on a vowel, but are part of the phonetic realization of the underlying consonant itself.