Environmental noise is one of the largest problem for users of voice technologies, such as hearing aids, mobile phones, and automatic speech recognition. Current approaches to source separation and speech enhancement typically attempt to modify the noisy signal in order to make it more like the original, leading to distortions in target speech and residual noise. In contrast, this project uses the innovative approach of driving a speech synthesizer using information extracted from the noisy signal to create a brand new, high quality, noise-free version of the original sentence. Improvements in noise suppression and speech quality from this approach are expected to have important broader impacts for both the 36 million Americans who are hearing impaired and the 200 million Americans who use smart phones. The project is also being incorporated into the curriculum in a diverse urban college and into established outreach programs to nearby high schools with the goal of encouraging members of under-represented groups to pursue careers in science and engineering.
This project aims to produce a high quality speech resynthesis system by modifying a concatenative speech synthesizer to use a unit-selection function based on a novel deep neural network (DNN) architecture. Preliminary results have shown this approach to work well for a small-vocabulary, speaker-dependent task, and the current project expands this to the large-vocabulary, speaker-dependent setting in three ways. First, it seeks to improve the intelligibility of the synthesized speech by utilizing perceptually motivated input features, more flexible training signals, and traditional speech enhancement. Second, it seeks to improve the system's scalability by training DNNs to embed noisy and clean speech into a joint low-dimensional space in which similarity can be efficiently computed. And third, it seeks to improve the quality of the synthesized speech by incorporating sequential models of speech units based on acoustic, phonetic, and linguistic compatibility. The use of speech synthesis models in speech enhancement is a departure from traditional approaches and has the potential to make a transformative impact on the quality of enhanced speech.