The ability to transform a ?foreign? accented voice into its ?native? counterpart could be an invaluable tool in pronunciation training for second-language learners. This requires separating those aspects of the speech signal that are determined by the anatomy of the vocal tract from those that result from the idiosyncratic way in which the speaker controls it. While these two sources interact in complex ways in the acoustic domain, a few studies indicate that they may be decoupled in the articulatory space, specifically in the vocal tract frontal cavity.
The objective of this research is to determine the extent to which foreign-accent conversion can be performed through articulatory inversion of the frontal cavity. For this purpose, two complementary problems are being investigated. First, existing articulatory datasets are being used to develop a foreign-accent conversion model that operates in the frontal cavity domain. Second, articulatory inversion models are being developed to estimate the frontal cavity configuration from speech acoustics. Results from these models are being systematically validated through perceptual tests of foreign-accentedness, speaker identity and acoustic quality.
English is a second language for a significant percentage of the workforce in the United States. Reduction of foreign accent becomes increasingly difficult beyond the ?critical period? of language learning, but substantial improvements in pronunciation do occur for adult second-language learners. This work will stimulate the development of new technology to facilitate such improvements. Its results may also find application for film dubbing/looping, as well as in speech technology at large (e.g., feature extraction, data compression).
Despite years or decades of immersion in a new culture, older learners of a second language (L2) typically speak with a so-called "foreign accent." Although a non-native accent does not necessarily limit intelligibility, L2 speakers can be subjected to discriminatory attitudes and negative stereotypes. Thus, by improving their pronunciation, adult L2 learners stand more to gain than mere intelligibility. Prior research has shown that L2 learners can benefit from imitating a native (L1) speaker with a similar voice as their own. However, finding such a "golden speaker" for each learner is impractical. The goal of this grant was to develop signal processing techniques to generate the ideal "golden speaker" for each L2 learner: their own voice, but with a native accent. Our approach is illustrated in Figure 1. In a first step (A), we build an articulatory synthesizer for the L2 learner: an algorithm that transforms the L2 speaker’s articulatory gestures, such as tongue and lip movements, into audio. In a second step (B), we drive the synthesizer with articulatory gestures from a L1 speaker. The resulting speech audio has the voice quality of the L2 speaker but the linguistic content (and thus the native accent) of the L1 speaker. A critical step in this process is the articulatory synthesizer. Throughout the project we developed and evaluated three types of synthesizer: concatenative, statistical, and neural. In concatenative synthesis, we collect a large database of short speech segments for the L2 speaker, each segment containing an articulatory gesture and the corresponding speech audio. Given an L1 utterance, we divide it into short segments, and for each segment we look up the database to find an L2 segment with similar gestures. In a final step, we concatenate the individual L2 segments to produce speech. Unfortunately, unless the database is large (hours of speech), the concatenated speech has noticeable discontinuities at the boundaries between segments. The statistical synthesizer avoids this problem by building a continuous function from the L2 database using machine-learning algorithms. To generate audio, one simply provides the statistical synthesizer with sequences of gestures: if we use L2 gestures, the result is L2 speech with a non-native accent; if we use L1 gestures, the result is L2 speech with a native-accent. The neural synthesizer operates in a similar fashion as the statistical synthesizer, except it uses different machine-learning algorithms to build the continuous function. The result of this accent-conversion process is speech that has never been produced, so it cannot be compared against any reference speech signal –the L2 learner can only produce foreign-accented speech. Instead, the quality of the accent conversion has to be assessed by asking human listeners to rate it. Thus, a second critical component of this project was to develop suitable listening tests. Throughout the project we developed listening tests for four subjective measures: accentedness, acoustic quality, speaker identity, and intelligibility. To measure accentedness, we ask participants to listen to two utterances (one containing the accent conversion, the second containing either L1 or L2 speech), then select the one that is more native-accented. To measure acoustic quality, we ask participants to listen to one utterance then rate its quality on a 5-point scale (1: bad quality; 5: excellent quality). To measure speaker identity, we ask participants to listen to two utterances, then decide whether they are from the same speaker or from two different speakers; to avoid interferences between accent and identity perceptions, utterances in this test are played backwards in time. Finally, to measure intelligibility, we ask participants to listen to an utterance, then transcribe it; intelligibility is then measured as the proportion of words in the utterance that were properly transcribed. Using this battery of tests, listeners rate accent conversions as having a more native accent than the original L2 speech and the same voice quality of the L2 speaker. Listeners also find accent conversions more intelligible than L2 speech. However, the acoustic quality of the accent conversions is significantly lower than that of the original speech, because the articulatory gestures used in this project (three points on the tongue, two on the lips and one on the jaw) only capture a small portion of all the movements in the vocal tract that are responsible for speech. Other articulatory measurement techniques, such as real-time magnetic resonance imaging, may be used in the future to capture more detailed information of the speech articulators. Future work will also evaluate the benefit of accent conversion in pronunciation training settings.