To investigate the neural substrates of the perception of audiovisual speech, we conducted a functional magnetic resonance imaging study with 28 normal volunteers. We hypothesized that the constraint provided by visually-presented articulatory speech (mouth movements) would lessen the workload for speech identification if the two were concordant, but would increase the workload if the two were discordant. In auditory attention sessions, subjects were required to identify vowels based on auditory speech. Auditory vowel stimuli were presented with concordant or discordant visible articulation movements, unrelated lip movements, and without visual input. In visual attention sessions, subjects were required to identify vowels based on the visually-presented vowel articulation movements. The movements were presented with concordant or discordant uttered vowels and noise, and without sound. Irrespective of the attended modality, concordant conditions significantly shortened the reaction time, whereas discordant conditions lengthened the reaction time. Within the neural substrates that were commonly activated by auditory and visual tasks, the mid superior temporal sulcus showed greater activity for discordant stimuli than concordant stimuli. These findings suggest that the mid superior temporal sulcus plays an important role in the auditory-visual integration process underlying vowel identification.