Cognitive Psychology
About

Speech Perception

The cognitive processes by which listeners extract linguistic information from the continuous, variable, and noisy acoustic signal of spoken language.

Percept = f(acoustic signal, phonological knowledge, lexical context)

Speech perception is one of the most impressive feats of human cognition. We effortlessly decode a continuous stream of sound into discrete words and meanings, despite enormous variability in the acoustic signal due to differences in speakers, speaking rates, accents, and background noise. The problem is so computationally difficult that automatic speech recognition systems, despite dramatic improvements, still struggle in conditions that humans handle effortlessly.

The Acoustic Signal

Speech sounds are produced by coordinated movements of the lungs, vocal folds, tongue, lips, and jaw. These articulatory gestures create complex acoustic patterns characterized by formant frequencies (resonances of the vocal tract), voice onset time (the delay between release of a consonant closure and the onset of voicing), and spectral transitions. The speech signal is continuous — there are no reliable acoustic boundaries between words, and often not between phonemes.

Categorical Perception

One of the earliest and most influential findings in speech perception research was categorical perception. When a physical continuum (such as voice onset time, which distinguishes /b/ from /p/) is varied in equal steps, listeners do not perceive a gradual change but instead perceive a sharp boundary between categories. Discrimination is far better across the category boundary than within a category, even for equally-spaced physical differences. This suggests that the speech perception system imposes discrete categories on continuous acoustic input.

Voice Onset Time (VOT) /b/ → VOT ≈ 0 ms (voiced)   /p/ → VOT ≈ 40-60 ms (voiceless)

The perceptual boundary between voiced and voiceless stops is remarkably sharp.

The Lack of Invariance Problem

The same phoneme can have radically different acoustic realizations depending on the surrounding phonemes (coarticulation), the speaker's vocal tract, speaking rate, and emotional state. The /d/ in "deep" and "doom" differ substantially in their acoustic properties because the tongue and lips are already moving toward the following vowel. Yet listeners perceive both as /d/. How the brain achieves this perceptual constancy despite acoustic variability — the lack of invariance problem — has been called the central challenge of speech perception.

Motor Theory and Direct Realism

Alvin Liberman's motor theory of speech perception proposed that listeners perceive not acoustic patterns but the intended articulatory gestures of the speaker. This elegantly solves the invariance problem: different acoustic signals map to the same perceived phoneme because they correspond to the same intended gesture. The discovery of mirror neurons — neurons that fire both when performing and observing an action — has provided some neurological plausibility for this view, and the McGurk effect (where visual information about lip movements alters what listeners hear) demonstrates audiovisual integration in speech perception.

The McGurk Effect

When an audio recording of "ba" is paired with video of a face saying "ga," most listeners perceive "da" — a fusion of the auditory and visual information. This illusion, discovered by Harry McGurk and John MacDonald (1976), powerfully demonstrates that speech perception is fundamentally multimodal: the visual speech signal (lip movements, facial gestures) is automatically integrated with the auditory signal.

Lexical and Contextual Effects

Top-down knowledge strongly influences speech perception. The phoneme restoration effect (Richard Warren, 1970) shows that when a phoneme is replaced by noise, listeners "hear" the missing phoneme based on lexical and sentence context. The Ganong effect demonstrates that an ambiguous sound between two phonemes is perceived as whichever interpretation forms a real word. These findings show that speech perception is an active inferential process combining acoustic evidence with linguistic knowledge.

Related Topics

External Links