Environmental noise is one of the largest problem for users of voice technologies, such as hearing aids, mobile phones, and automatic speech recognition. Current approaches to source separation and speech enhancement typically attempt to modify the noisy signal in order to make it more like the original, leading to distortions in target speech and residual noise. In contrast, this project uses the innovative approach of driving a speech synthesizer using information extracted from the noisy signal to create a brand new, high quality, noise-free version of the original sentence. Improvements in noise suppression and speech quality from this approach are expected to have important broader impacts for both the 36 million Americans who are hearing impaired and the 200 million Americans who use smart phones. The project is also being incorporated into the curriculum in a diverse urban college and into established outreach programs to nearby high schools with the goal of encouraging members of under-represented groups to pursue careers in science and engineering.

This project aims to produce a high quality speech resynthesis system by modifying a concatenative speech synthesizer to use a unit-selection function based on a novel deep neural network (DNN) architecture. Preliminary results have shown this approach to work well for a small-vocabulary, speaker-dependent task, and the current project expands this to the large-vocabulary, speaker-dependent setting in three ways. First, it seeks to improve the intelligibility of the synthesized speech by utilizing perceptually motivated input features, more flexible training signals, and traditional speech enhancement. Second, it seeks to improve the system's scalability by training DNNs to embed noisy and clean speech into a joint low-dimensional space in which similarity can be efficiently computed. And third, it seeks to improve the quality of the synthesized speech by incorporating sequential models of speech units based on acoustic, phonetic, and linguistic compatibility. The use of speech synthesis models in speech enhancement is a departure from traditional approaches and has the potential to make a transformative impact on the quality of enhanced speech.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1618061
Program Officer
Tatiana Korelsky
Project Start
Project End
Budget Start
2016-06-15
Budget End
2021-05-31
Support Year
Fiscal Year
2016
Total Cost
$457,958
Indirect Cost
Name
CUNY Brooklyn College
Department
Type
DUNS #
City
Brooklyn
State
NY
Country
United States
Zip Code
11210