In recent years, there is a dramatic increase in the deployment of the voice-based interface for human-machine communication. Such devices typically have multiple microphones (or channels), and as they are used in homes, cars, and so on, a major technical challenge is how to reliably localize a target speaker and recognize his/her speech in everyday environments with multiple sound sources and room reverberation. The performance of traditional approaches to localization and separation degrades significantly in the presence of interfering sounds and room reverberation. This project investigates multi-channel speaker localization and speech separation from a deep learning perspective. The innovative approach in this project is to train deep neural networks to perform single-channel speech separation in order to identify the time-frequency regions dominated by the target speaker. Such regions across microphone pairs provide the basis for robust speaker localization and separation.
Building on this novel perspective, the proposed research seeks to achieve robust speaker localization and speech separation. For robust speaker localization, time-frequency (T-F) masks will be generated by deep neural networks (DNN) from single-channel noisy speech signals. Across each pair of microphones, an integrated mask will be calculated from the two corresponding single-channel masks and then used to weight a generalized cross-correlation function, from which the direction of the target speaker will be estimated. An alternative method for localization will be based on mask-weighted steered responses. For robust speech separation, masking-based beamforming will be initially performed, where T-F masking and accurate speaker localization are expected to enhance beamforming results substantially. To overcome the limitation of spatial filtering in multi-source reverberant conditions, spectral (monaural) and spatial information will be integrated as DNN input features in order to separate only the target signal with speech characteristics and originating from a specific direction. The proposed approach will be evaluated using automatic speech recognition rate, as well as localization and separation accuracy, on multi-channel noisy and reverberant datasets recorded in real-world environments. This will ensure a broader impact not only in advancing speech processing technology but also in facilitating the design of next-generation hearing aids in the long run.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.