Articulatory inversion is the problem of recovering the sequence of vocal tract shapes that produce a given acoustic utterance. Articulatory representations are useful for automatic speech recognition, speech production research, language therapy, and language learning. Articulatory inversion is a hard problem because different vocal tract shapes can produce the same acoustics, yet the articulatory trajectory must obey the mechanical constraints of the human vocal tract. Other examples of inversion problems over a sequence, which share the multivalued nature of the mappings and the existence of constraints, are: the recovery of facial gestures associated with a speech utterance; the inverse kinematics of a robot arm; and the recovery of 3D motion from video.

This project approaches articulatory inversion from a machine learning standpoint, based on a framework introduced by the PI. The low-dimensional manifold in articulatory-acoustic space is represented in a probabilistic way by a density model estimated from data (recorded using a microphone and electromagnetic articulography). Multivalued mappings are explicitly represented by the modes of conditional distributions of this density, and the articulatory trajectory is disambiguated using a continuity constraint.

The project introduces new problems in dimensionality reduction, density estimation and regularization (such as multivalued regression and graph-learning from noisy data), and new models and algorithms. The expected results of this work are: performing basic research in machine learning, and introducing mapping inversion problems to research and education; improving articulatory inversion (for which code will be made freely available); and advocating data-driven approaches in speech production research and education.

Project Report

The practical motivation for this project was the solution of difficult inverse problems such as articulatory inversion in speech processing, where we want to recover the vocal tract shape that produced a given utterance; people tracking in computer vision, where we want to recover the 3D pose of a person from a video; or inverse kinematics in robotics, where we want to determine the joint angles that will position a robot arm along a desired trajectory in workspace. The PI developed machine learning algorithms and theory for problems inspired by these applications. One specific area of research concerned algorithms to reduce the dimensionality of data. The PI developed several new algorithms, as well as numerical optimization methods to accelerate their training, and extended some of them to the case where part of the training or testing data is missing. One of these algorithms, the Laplacian Eigenmaps Latent Variable Model, was used in a people tracking application. Other applications of these algorithms involved the 2D or 3D visualization of high-dimensional data. Another specific area of research concerned mean-shift algorithms, which have traditionally been used for clustering problems, such as segmenting an image into meaningful objects. The PI has contributed to the theory of mean-shift algorithms, by proving their convergence and their order of convergence, and relating them to expectation-maximisation (EM) algorithms; and to their practical application, by developing fast numerical optimization methods for them. The PI has also extended their applicability to problems beyond clustering: to denoising point clouds that have a low-dimensional structure (such as the surface of a 3D object as measured with a 3D laser scanner, or the manifold defined by a collection of images of handwritten digits that vary in slant, thickness, style, etc.); and to reconstructing missing entries of a matrix, as in recommender systems. The PI has also contributed to the problem of articulatory inversion of speech. Through the application of machine learning techniques to databases of acoustic and articulatory speech, he has quantified the frequency with which speakers produce a given, fixed speech sound using more than one vocal tract shape. He has developed an articulatory inversion algorithm that explicitly estimates these different vocal tract shapes, and he has also applied this algorithm to the inverse kinematics problem in robotics. This research has contributed to the theoretical and computational understanding of various existing and new unsupervised learning algorithms, with particular emphasis in their optimisation, and has illustrated their application in the areas mentioned above. Datasets, as well as Matlab implementations for most of the algorithms resulting from this research, are available for free from the PI's web page.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0754089
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2007-08-01
Budget End
2011-12-31
Support Year
Fiscal Year
2007
Total Cost
$385,267
Indirect Cost
Name
University of California - Merced
Department
Type
DUNS #
City
Merced
State
CA
Country
United States
Zip Code
95343