Our voices are not identical, they are our identities. The human voice is a powerful signal that conveys one's age, gender, size, ethnicity, and personality, among other attributes. Yet, until now, users of augmentative and alternative communication (AAC) devices, screen reading technologies and other text-to-speech (TTS) applications have relied on a limited set of mass-produced, generic-sounding synthetic voices. This mismatch in vocal identity impacts educational outcomes, infringes on personal safety, and hinders social integration. Conventional methods for building a synthetic voice require a voice actor to record an extensive dataset of studio-quality recordings which are used to train a computational model and generate the output voice. The process is time and labor intensive and thus inaccessible to everyday consumers let alone those with speech impairment. VocaliD Inc's award winning technology offers an unprecedented means to build custom crafted synthetic voices that reflect the recipient by combining his/her own residual vocalizations with recordings of a matched speaker from our Human Voicebank. We have discovered that even a single vowel contains enough vocal DNA to seed the personalization process. VocaliD's custom voice sounds like the recipient in age, personality and vocal identity but is as clear and understandable as the donor's recordings. To create an affordable and efficient method of voice personalization, we leverage the penetration of high quality microphones and recording software on consumer grade computers and increased technological literacy to crowdsource the collection of speech and voice recordings. This enables engagement across broad age, socioeconomic, cultural and linguistic groups in order to truly sample the diversity of the human voice. The challenges, however, are to ensure high quality recordings and to sufficiently engage speech donors to complete the recording corpus. This Phase II project builds upon our success in Phase I to reduce the length of the donor corpus and to streamline and automate the recipient protocol. Results of our perceptual experiments indicated that while we were able to reduce the length of the donor corpus by 70%, it came at the cost of reduced intelligibility and naturalness. Since voice quality is vital to acceptance and adoption of our voices, this Phase II proposal is aimed at improving the clarity and expressiveness of our voices while maintaining the optimized corpus length. We propose to improve TTS intelligibility by developing methods to mitigate the effects of background noise and reverberation during donor and recipient recordings and aligning expected and actual spoken transcripts to reduce errors in TTS model building (Aim 1). To address the issue of TTS naturalness, we propose to modify the donor corpus to include more prosodically diverse contrasts and adapt the donor protocol to elicit natural melodic intonation and phrasing (Aim 2). These advances will yield a scalable and cost-effective method of personalized voice creation that will humanize speech-enabled technologies for AAC and beyond.
VocaliD's breakthrough technology powers the first-ever custom synthetic voices that are made using only a brief sample of the recipient's residual voice combined with recordings of a matched speaker from a crowdsourced voicebank. This Phase II SBIR proposal addresses the challenge of creating a scalable and affordable method for achieving high quality, natural sounding, personalized voices from sparse and `non-laboratory grade' recipient and speech donor samples.