Video-based Speech Enhancement for Persons with Hearing and Vision Loss Project Summary It is estimated that by 2030, the number of people in the United States over the age of 65 will account for over 20% of the total population. Hearing and vision loss naturally accompanies the aging process. Persons with hearing loss can benefit from observing the visual cues from a speaker such as the shape of the lips and facial expression to greatly improve their ability to comprehend speech. However, persons with vision loss cannot make use of these visual cues, and have a harder time understanding speech, especially in noisy environments. Furthermore, people with normal vision can use visual information to identify a speaker in a group, which allows them to focus on this person. This can greatly benefit a person with hearing loss who may be using a device such as a sound amplifier or a hearing aid. A user with vision loss, however, needs to be provided with this speaker information to make optimal use of such devices. We propose developing a prototype device that will clean the speech signal from a target speaker and improve speech comprehension for persons with hearing and vision loss in everyday situations. In order to accomplish this task, we need to harness the visual cues that have so far largely been ignored in the design of assistive technolo- gies for persons with hearing loss.
Our first aim i s to learn speaker-independent visual cues that are associated with the target speech signal, and use these audio-visual cues to design speech enhancement algorithms that perform much better in noisy everyday environment than current methods which only utilize the audio signal. We will utilize a video camera and computer vision methods to design advanced digital signal processing techniques to enhance the target speech signals recorded through a microphone.
Our second aim i s to use the video and audio signals to detect and efficiently localize the visible speaker. The information regarding the location of the speaker of interest can then be used to efficiently perform speaker separation, as well as be provided to the user. Finally, we aim to implement these developed algorithms on a portable prototype system. We will test the performance of this system and improve the user-interface through user experiments in real-world situations as well as laboratory conditions. The end product will show the feasibility and importance of incorporating multiple modalities into sensory assistive devices, and set the stage for future research and development efforts.

Public Health Relevance

It is estimated that by 2030, more than one in five people in the United States will be over the age of 65. Age- related hearing and vision loss is considered a natural consequence of the aging process, yet current assistive technology approaches do little to address this type of sensory loss. The proposed research will test the feasibility of incorporating visual information in hearing aids, which is expected to improve speech perception for persons with hearing and vision loss in everyday situations, greatly enhancing their ability to lead independent lives, remain employable, and maintain active participation in society.

National Institute of Health (NIH)
National Eye Institute (NEI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Special Emphasis Panel (BNVT)
Program Officer
Wiggs, Cheri
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Smith-Kettlewell Eye Research Institute
San Francisco
United States
Zip Code