This EArly Concept Grant for Exploratory Research (EAGER) investigates new machine learning techniques for discovering sub-word units in speech for use in automatic speech recognition (ASR). The representation of this EArly Concept Grant for Exploratory Research investigates new machine learning techniques for discovering sub-word units in speech for use in automatic speech recognition (ASR). The representation of words in terms of sub-word units is rarely learned from data, despite significant disagreement among linguists as to the sub-word unit inventory. This project represents exploratory work toward a larger goal of making all aspects of ASR learnable, using scientific insights while being discriminatively trained.
In contrast with prior work, speech segments are clustered into units using discriminatively learned segmental similarities, rather than via dynamic time warping or hidden Markov models. Rather than pre-supposing phoneme-like units, multiple heterogeneous unit types are learned. The project also leverages multi-modal (video, articulatory, and so on) data to improve unit discovery by sharing information across modalities. In this exploratory work, the learned units are used in a discriminative model that rescores initial outputs from a standard phone-based recognizer, and the experiments focus on small-/medium-vocabulary recognition.
This project explores new ways of discovering the basic units of speech. Beyond improvements to speech recognition, this project has the potential for broad impact on other research areas involving sequences with segmental sub-structure (such as text, video, biological data, and financial data) or involving clustering. The results may also include new representations for the study of speech in linguistics and speech science. From a societal perspective, in the long term making speech recognition more learnable will enable improved porting of the technology to under-served linguistic communities, which do not have the benefit of large data sets or other resources.