Voice conversion (VC) systems transform segments of speech from a given source speaker so that it can be identified as spoken by a specified target speaker. Currently, standard VC systems require parallel training on extensively labeled sets of speech data where the source and target speaker share equivalent content for building direct mapping models. This project builds on the concepts of nonparallel VC systems reducing the need for labeled and shared speech content between source and target speakers as well as allowing for both intra-lingual and cross-lingual conversion scenarios. This project focuses on two main areas: (1) Building a framework for non-parallel VC without explicit phonetic, sound, word, or sentence level labels, and (2) Providing effective target speaker mapping to obtain converted speech with as good as or better quality compared to current VC systems. The VC framework consists of three main components: (1) A speaker independent language model; (2) An algorithm for model adaptation to target speaker; (3) A speech synthesis block to generate converted speech from a target-adapted language model.
This project will provide a broad framework for applications such as personalization of assistive textto- speech (TTS) systems, foreign language learning, and as a possible component in speech-to-speech translation systems. This project will support graduate student research and provide results for community distribution through conference and journal submission. Additionally, an open-source software toolset will be developed and freely distributed. The project will also be used in outreach for underrepresented groups in Science Technology Engineering and Mathematics (STEM) disciplines.