Tests of several different approaches to the automatic evaluation of the quality of speech segments are proposed. Previous systems for use in pronunciation training have typically employed either automatic speech-recognition (ASR) technology, or have used templates based on a limited number of utterances rated as excellent by L1 listeners (and sometimes also employing a second set of utterances containing a common pronunciation error). Here speech-processing technologies (HMM's and ANN's) will be developed specifically for use as evaluation systems (not recognition systems) to predict quality and locus-of-error judgments assigned by listeners. Termed the """"""""evaluation-of-single-words"""""""" (ESW) approach, the special feature of these systems will derive from the training tokens employed in their development: multiple recordings of a single word made by groups of native and non-native talkers. Sixty talkers will be native speakers of Arabic, whose intelligibility in English ranges from poor to near-perfect, and 60 talkers will be native speakers of middle-American English. There will be twelve words divided between one, two, and three syllables. Ten productions of each word will be recorded by each talker, yielding 14,400 tokens. Each token will be rated by listening juries for pronunciation quality, and the tokens will also be categorized into perceptual clusters, using MDS and cluster-analysis techniques. At least two computer-based evaluation systems (HMM and ANN) will be trained for each individual word, with the goals of predicting overall pronunciation quality and identifying specific commonly occurring pronunciation errors. It is expected that these word-specific systems, each representing a discrete """"""""evaluator"""""""" custom-built for an individual word, will approach the maximum accuracy that can be expected of this class of processors. If successful, the ESW approach may have a broad range of applications in pronunciation training, identification of a speaker's L1, foreign-language instruction, and other non-lexical applications. However, our specific goal is the development of systems that can provide informative feedback during automated pronunciation training. In ASR applications, the goal is to respond the same way to a word, no matter how it is pronounced. The goal of an ESW system is to respond differentially to pronunciation variants. This distinction between ASR and ESW is central to the development of successful evaluation systems as it dictates different modeling constraints.
Williams-Sanchez, Victoria; McArdle, Rachel A; Wilson, Richard H et al. (2014) Validation of a screening test of auditory function using the telephone. J Am Acad Audiol 25:937-51 |