It has long been postulated that a human determines the linguistic identity of a sound based on detected evidences that exist at various levels of the speech knowledge hierarchy, from acoustics to pragmatics. Indeed, people do not continuously convert a speech signal into words as an automatic speech recognition (ASR) system attempts to do. Instead, they detect acoustic and auditory evidences, weigh them and combine them to form cognitive hypotheses, and then validate the hypotheses until consistent decisions are reached. The above human-based model of speech processing suggests a candidate framework for developing next generation speech technologies that have the potential to go beyond the current limitations.
In order to bridge the performance gap between ASR systems and humans, the narrow notion of speech-to-text in ASR has to be expanded to incorporate all related human information "hidden" in speech utterances. Instead of the conventional top-down, network decoding paradigm for ASR, we are establishing a bottom-up, event detection and evidence combination paradigm for speech research to facilitate collaborative Automatic Speech Attribute Transcription (ASAT). The goals of the proposed project are: (1) develop feature detection and knowledge integration modules to demonstrate ASAT and ASR; (2) build an open source, highly shared, plug-'n'-play ASAT cyberinfrastructure for collaborative research to lower entry barriers to ASR; and (3) provide an objective evaluation methodology to monitor technology advances in individual modules and across the entire system.