Speech technology is supposed to be available for everyone, but in reality, it is not. There are 7000 languages spoken in the world, but speech technology (speech-to-text recognition and text-to-speech synthesis) only works in a few hundred of them. This project will solve that problem, by automatically figuring out the set of phonemes for each new language, that is, the set of speech sounds that define differences between words (for example, "peek" versus "peck:" long-E and short-E are distinct phonemes in English). Phonemes are the link between speaking and writing. A neural net that converts speech into text using some kind of phoneme inventory, and then back again, can be said to have used the correct phoneme inventory if its resynthesized speech always has the same meaning as the speech it started with. This approach can even be tested in languages that don't have any standard written form, because the text doesn't have to be real text: it could be chat alphabet (the kind of pseudo-Roman-alphabet that speakers of Arabic and Hindi sometimes use on twitter), or it could even be a picture (showing, in an image, what the user was describing). This research will make it possible for people to talk to their artificial intelligence systems (smart speakers, smart phones, smart cars, etc.) using their native languages. This research will advance science by providing big-data tools that scientists can use to study languages that do not have a (standard) writing system.
End-to-end neural network methods can be used to develop speech-to-text-to-speech (S2T2S) and other spoken language processing applications with little additional software infrastructure, and little background knowledge. In fact, toolkits provide recipes so that a researcher with no prior speech experience can train an end-to-end neural system after only a few hours of data preparation. End-to-end systems are only practical, however, for languages with thousands of hours of transcribed data. For under-resourced languages (languages with very little transcribed speech) cross-language adaptation is necessary; for unwritten languages (those lacking any standard and well-known orthographic convention), it is necessary to define a spoken language task that doesn't require writing before one can even attempt cross-language adaptation. Preliminary evidence suggests that both types of cross-language adaptation are performed more accurately if the system has available, or creates, a phoneme inventory for the under-resourced language, and leverages the phoneme inventory to facilitate adaptation. The aim of this project is to automatically infer the acoustic phoneme inventory for under-resourced and unwritten languages in order to maximize the speech technology quality of an end-to-end neural system adapted into that language. The research team has demonstrated that it is possible to visualize sub-categorical distinctions between sounds as a neural net adapts to a new phoneme category; proposed experiments 1 and 2 leverage visualizations of this type, along with other methods of phoneme inventory validation, to improve cross-language adaptation. Experiments 3 and 4 go one step further, by adapting to languages without orthography; for a speech technology system to be trained and used in a language without orthography, it must first learn a useful phoneme inventory. Innovations in this project that occur nowhere else include: (1) the use of articulatory feature transcription as a multi-task training criterion for an end-to-end neural system that seeks to learn the phoneme set of a new language, (2) the use of visualization error rate as a training criterion in multi-task learning -- this training criterion is based on a method recently developed to visualize the adaptation of phoneme categories in a neural network, (3) the application of cross-language adaptation to improve the error rates of image2speech applications in a language without orthography, (4) the use of non-standard orthography (chat alphabet) to transcribe speech in an unwritten language, and (5) the use of non-native transcription (mismatched crowdsourcing) to jump-start the speech2chat training task. The methods proposed here will facilitate the scientific study of language, for example, by helping phoneticians to document the phoneme inventories of undocumented languages, thereby expediting the study of currently undocumented endangered languages before they disappear. Conversely, in minority languages with active but shrinking native speaker populations, planned methods will help develop end-to-end neural training methods with which the native speakers can easily develop new speech applications. All planned software will be packaged as recipes for the speech recognition virtual kitchen, permitting high school students and undergraduates with no speech expertise to develop systems for their own languages, and encouraging their interest in speech.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.