The human voice, our oldest and most reliable communication tool, is now rapidly becoming the input interface of choice that we use everyday to interact with technologies such as car navigation systems, medical and legal dictation systems, personal assistants like "Siri," automated financial systems, etc. Thousands of 'apps' have been developed to help consumers use voice to get the information they are looking for. Speech recognition is the backbone of all of these technologies. As a result, the performance of speech recognizers is key for customer satisfaction. Currently, many systems still need to be tuned for a particular speaker to perform well, and the recognition task has to be limited in other ways such as requiring (1) usage of a specific vocabulary, (2) clear pronunciation of most of the words, especially the content words and (3) limited background noise. In this research, speech variability will be studied, and methods and models will be developed that will enable recognizers to be more speaker independent and capable of handling the full range of speech styles from clear articulation to very casually spoken speech. The results will also bear on linguistic models of speech planning and organization, providing evidence for how speakers trade off efficiencies in the production of speech against the need to be intelligible.
In this project, point source tracking of the speech articulators will be collected concurrently with the corresponding acoustics. Speakers will record speech at both a normal and rapid pace (the purpose of the latter is to increase significantly the degree of variability in the signal). This data will allow for the investigation of whether speakers always move their speech articulators in the direction of a desired target (e.g. tongue tip to teeth in producing /t/) even when a rapid production pace occludes the relevant acoustic information (as in "perfect"). If confirmed, this finding will point the way towards making recognition systems more robust through the incorporation of articulatory information. In addition, such data will support the development of a speech inversion system capable of 'uncovering' hidden articulatory movements potentially masked from the acoustics.