In recent decades synthetic speech has become a ubiquitous and increasingly seamless aspect of human-machine interfaces. Although cars, microwaves, phones, and kiosks all "talk" in human-like ways, the naturalness and personality of these voices fall short of human expression. While this may not matter for many text-to-speech (TTS) applications, over two million Americans with severe speech-motor impairments require assistive communication aids with TTS output. Concatenative TTS synthesizers yield highly intelligible voices, yet many assistive devices rely on small footprint, formant synthesis that sounds robotic and has poor intelligibility. Moreover, the choice of voices on conventional devices is limited and does not reflect the user; it is not uncommon for a child to use the same voice her whole life and for her peers to share that same voice even when using different devices. This lack of attention to the individuality of synthetic voices has consequences on adoption of assistive technology as an extension of the user, and may adversely impact societal attitudes toward the user group.

In her prior work the PI began to address these issues by adapting a concatenative synthesizer constructed from acoustic recordings of a healthy talker using vocal source characteristics obtained from a target talker with speech impairment. The adapted voice was highly intelligible and conveyed the target user's identity, yet it also retained substantial elements of the healthy talker's identity due to the influence of vocal tract filter characteristics. This suggests that personalized speech synthesis may be more successful utilizing an alternative approach, in which acoustic and articulatory data from healthy talkers are combined with both source and filter characteristics from target talkers to generate an individualized voice. In this project, the PI will develop hybrid statistical parametric synthesis techniques to model vocal tract and source characteristics of impaired talkers, with the goal of generating highly intelligible and personalized synthetic speech. The PI envisages a future where source and filter parameters of a Hidden Markov Model (HMM) based synthesizer can be adapted to model a child user's vocal tract and modified over time to "grow" with his maturing vocal system, fostering a stronger personal connection between the user and the communication device.

Broader Impacts: This project strives to make communication accessible and socially fulfilling by designing an enabling technology that blurs the line between system and user. The human voice is not merely a signal; it has an individualized and personal quality that impacts how others perceive us and how we interact with those around us. The ultimate goal of this work is to afford users of TTS the same ownership and individuality as the natural voice. Project outcomes will have broad impact both on users of assistive aids and able-bodied users of TTS technologies. The research may also lead to a novel and innovative means of assessing the nature and articulatory locus of speech impairment, by comparing model parameters to impaired productions. The interdisciplinary nature of this research will promote teaching, training and learning in computer science and in speech and hearing sciences.

Project Report

We are surrounded by a world in which we interact with computers using speech. For individuals who rely on assistive communication technologies, synthetic speech offers them a voice. Unfortunately, the voice options are limited and do not reflect the user; it is not uncommon for several children in a classroom or adults in a workplace to user the same voice. Individuals may also use the same voice their whole life despite having grown and changed devices over a dozen times. This work addressed this lack of individuation of prosthetic voice technology. The intellectual merit of this work relate to scientific advances in speech production and speech technologies and include new algorithm development and deployment. To develop personalized voices for individuals with limited vocal output, we began by adapting a concatenative synthesizer built using a database of recordings from a healthy donor and samples of residual vocalizations from the end user. Our empirical findings and recent developments in Hidden Markov Model (HMM) based synthesis suggested that alternative approaches may improve personalized synthesis and also provide unique insights into the nature of a target talker’s speech production deficits. The current grant extended our earlier work by integrating acoustic and articulatory features to 1) extract vocal source characteristics that convey speaker identity, 2) extract vocal tract filter characteristics to further preserve speaker identity and 3) statistical modeling of natural, intelligible speech using aggregates of acoustic and articulatory dynamics across speakers. We found that modeling acoustic and articulatory similarities across speakers resulted in fluent natural-sounding synthetic speech. We then made advances toward integrating our personalized voice into existing assistive communication devices. Toward that end, we worked with two different manufacturers on the iOS and windows platform to provide a personalized voice to 3 beta users. These individuals have reported changes in the quality and quantity of communication exchanges and improvements in overall quality of life related to social integration and access to educational opportunities. Additional broader impacts include novel insights into modeling speech production, new approaches to sparse sample voice conversion and increased awareness of speech and voice impairment as well as the limitations of current assistive communication technologies.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Standard Grant (Standard)
Application #
Program Officer
Ephraim P. Glinert
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Northeastern University
United States
Zip Code