Our voices are not identical, they are our identities. The human voice is a powerful signal that conveys one?s age, gender, size, ethnicity, and personality, among other attributes. Yet, until now, users of augmentative and alternative communication (AAC) devices, screen reading technologies and other text-to-speech (TTS) applications have relied on a limited set of mass-produced, generic-sounding synthetic voices. This mismatch in vocal identity impacts educational outcomes, infringes on personal safety, and hinders social integration. Conventional methods for building a synthetic voice require a voice actor to record an extensive dataset of studio-quality recordings which are used to train a computational model and generate the output voice. The process is time and labor intensive and thus inaccessible to everyday consumers let alone those with speech impairment. VocaliD Inc?s award-winning technology offers an unprecedented means to build custom crafted synthetic voices that reflect the recipient by combining his/her own residual vocalizations with recordings of a matched speaker from our crowdsourced Human Voicebank. We have discovered that even a single vowel contains enough vocal DNA to seed the personalization process. VocaliD?s custom voices sound like the recipient in age, personality and vocal identity and have the clarity of everyday talkers. Having made significant progress towards improving the intelligibility and naturalness of our custom voices under Phase II, our voices are within a few percentage points of natural human speech in terms of intelligibility and rated as highly natural sounding by unfamiliar listeners. However, several persistent issues limit the commercial potential of our current methods. First, our new methods are computationally intensive and thus cannot be utilized on current assistive communication devices. Optimization of the methods to reduce latency and thereby improve usability is critical (Aim 1). Another unintended consequence of advances in clarity and naturalness of our voices is the potential for misappropriation. To counteract this, we propose developing a multi-speaker model to create unique new voices and mask the identity of a given speech donor (Aim 2). Last, although the new models are capable of more prosodic variation, current methods rely heavily on exemplars in the training data. Our customers indicate a need and desire for greater control of subtle yet meaningful differences in prosody.
(Aim 3) These additional tasks will further bolster the product and likelihood of commercial success for the AAC market and beyond.

Public Health Relevance

VocaliD?s breakthrough technology powers the first-ever custom synthetic voices that are made using only brief samples of recipient vocalizations combined with recordings of matched speaker(s) from our crowdsourced voicebank. This Phase II SBIR Administrative Supplement proposal addresses the challenges of creating a scalable and efficient method for achieving high quality, natural sounding, and controllable personalized voices.

Agency
National Institute of Health (NIH)
Institute
National Institute on Deafness and Other Communication Disorders (NIDCD)
Type
Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #
3R44DC014607-03S1
Application #
9966587
Study Section
Program Officer
Shekim, Lana O
Project Start
2015-06-01
Project End
2020-06-30
Budget Start
2019-09-02
Budget End
2020-06-30
Support Year
3
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Vocalid, Inc.
Department
Type
DUNS #
079399198
City
Belmont
State
MA
Country
United States
Zip Code
02478