Despite great advances in computer recognition of conventional speech, automatic recognizers have enormous trouble with speech that deviates from speech norms in pitch range, speaking style, and timing. Unconventional speech includes the "motherese" used to speak to young children, certain kinds of dysarthric speech, and singing. Current speech recognizers draw their power from statistical models of very large collections of real speech, but the corollary of this power is that speech that differs from this norm cannot be handled nearly so well.
The goal of this project is to create a speech recognition system able to handle a broad range of non-canonical speaking and voicing styles. As a motivating basis, we will target the transcription of singing. Sung speech poses a number of significant challenges with implications in broader speech scenarios: In comparison with conventional speech, the speech timing is highly distorted; the pitch level, range, and dynamics are very different; and there are frequently simultaneous sound sources (i.e., accompanying instruments) whose signals must be distinguished from the voice.
The approach is to make a best-effort separation of the voice, e.g., by closely filtering the predominant pitch in a mixed signal. This candidate voice is then transformed and normalized to resemble conventional speech: The pitch harmonics are interpolated to achieve a more pitch-invariant spectrum, and the time axis is warped to achieve a more uniform rate of change (eliding over sustained, unchanging sounds). Then, a conventional speech recognizer is adapted to recognize this normalized speech. To train the recognizer for the target domain, a substantial collection of music audio is manually aligned with phoneme-level transcriptions of the singing. This corpus will be freely available to other researchers in music and non-canonical speech.
This work will develop techniques to make current speech recognition applicable to a much broader range of speech material and speakers.