SBIR Phase II: Robust Speech-to-Text Messaging

Rao, Ashwin

Abstract

This Small Business Innovation Research (SBIR) Phase II research project proposes to develop techniques for the hands-free input of text to mobile devices. Specifically, this project extends the results of the Phase I effort to produce a speech-recognition system for mobile devices and personal appliances that is robust in the presence of background noise. To increase the speech recognition accuracy, four techniques are employed: 1) Spellation where the users have to speak and partially spell the words as they dictate, 2) VoiceTap which requires that, for each character, the user says that character and the following character in the alphabet, 3) Voice Predict where the user has to say the word and input the first character of the word using the keyboard or VoiceTap, and 4) multi-modal speech to text, where the user speaks and uses the keyboard simultaneously. The research effort will focus on developing modules that allow speech to be dictated using a combination of whole words and spelled words.

The outcome of the proposed research has significant commercial potential. Because the front end or client-side can be ported to a variety of operating systems and processors, the flexibility of this technology should enable wide licensing of the technology to telecommunication device manufacturers. The mobile wireless industry is very large and growing industry, and multi-modal input technology is important to mobile customers who demand more efficient and accurate methods for communication. Improvements in accuracy could be very significant and would potentially have widespread applicability.

Project Report

VOICEPREDICT, A MULTIMODAL TEXT INPUT THAT COMBINES VOICE AND TOUCH Background Facilitating text input into computers and handheld devices is a work in progress. Well-known solutions include mobile triple-tapping, ambiguous/unambiguous text prediction, mini-qwerty keyboards, on-screen soft-key displays, and handwriting/gesture recognition. In theory, speech-to-text would be a natural alternative: if one could simply speak into their computer or device and have the text magically appear on the screen. Unfortunately, speech-to-text has been historically plagued with problems, including infinite language perplexity, background and channel noises, varied pronunciations, unacceptable speaker-training methods, and lack of intuitive error-correction. As a result, speech-to-text continues to be a futuristic technology; except for specialized applications wherein the lexicon is fairly compressed as in call-center automation. Multimodal Technology Underlying VoicePredict TravellingWave has taken an innovative approach that combines redundant information from multiple modes, namely the keyboard and the microphone, to significantly enhance accuracies of both voice recognition and text prediction. Specifically, Voice Powered Text Prediction or VoicePredict technology predicts words using the speech rendered by a user in addition to the letters inputted by the user; traditional predictive text input systems rely only on letters; speech-to-text systems rely on speech only. VoicePredict System Components The following components form the basis for the VoicePredict system are now briefly described. Frequency Localized Temporal processing At the very front-end of VoicePredict lies TravellingWave's proprietary signal processing module called RAGs Algorithm (Rao-Aronov-Garafutdinov speech processing algorithm). It is based on published research on compact features modeling the traveling wave phenomena in the human cochlea. RAGs extracts modulation information from speech, as opposed to traditional power spectrum analysis. For example, instead of relying on the spectral energy envelope, RAGs computes the locations of several resonances (similar to speech formants); their slowly varying temporal trajectories; rich modulations coding the harmonics around those resonances; local bandwidths, syllable onset and offset times; durations of phones and other proprietary acoustic-phonetic features. RAGs output is then employed by the acoustic-phonetic models to make decisions about word prediction. Overall, this enables VoicePredict system to perform reliably in a variety of noisy environments. Acoustic Modeling VoicePredict combines traditional acoustic modeling techniques (based on modeling phonemes using statistical models) with acoustic-phonetic modeling. In VoicePredict, the latter incorporates spectrally localized temporal features, in conjunction with features like phonetic durations, syllable boundaries and formant energies. Language Modeling VoicePredict adapts its language model based on the frequency of word usage. New words that are not in the large built-in dictionary (tens of thousands of words) are learnt on the fly. Currently VoicePredict employs unigram language models; meaning VoicePredict does not rely on a sentence context. The unigram modeling techniques make use of VoicePredict's inherent multimodality, resulting in an extremely robust language model. In this project, the Frequency Localized Temporal processing module was first developed and integrated with the overall VoicePredict system; by the TravellingWave researchers. A software development kit (SDK) was sunsequently designed and developed for the overall system. This SDK was then integrated by researchers at the Human-Computer-Interface laboratory of the Carnegie Mellon University, into a simulator. The objective was to study the effectiveness of the novel multimodal interface, VoicePredict, compared to regular virtual keyboard and 9-digit keypad; in writing out some phrases. OBSERVATIONS Using VoicePredictâ€™s Speak & Type, the speed of input was 30% faster than the keyboard and the average number of key strokes to complete sentences decreased by more than 80%. These results demonstrate that it is significantly faster and easier to enter text accurately using VoicePredict. SUMMARY This SBIR Phase-II project has been very successful in developing a novel multimodal text input solution which has significant potential to revolutionize human-machine interfaces in general. The technology is ready, in the form of a SDK, for commercialization. In-house and external benchmarking experiments reveal that VoicePredict is the fastest input method with minimal requirements of key-presses. The company has already launched one product in a mobile marketplace, and is currently in discussions to volume license VoicePredict to device manufacturers and mobile operators.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Industrial Innovation and Partnerships (IIP)
Type: Standard Grant (Standard)
Application #: 0724271
Program Officer: Glenn H. Larsen

Project Start
Project End
Budget Start: 2007-09-01
Budget End: 2010-08-31
Support Year
Fiscal Year: 2007
Total Cost: $716,000
Indirect Cost

SBIR Phase II: Robust Speech-to-Text Messaging
Rao, Ashwin
Travellingwave, Seattle, WA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments