This project will employ a cyclical two-step process to develop a computational model that embeds dynamic expression and socially engaging non-verbal gestures into talking avatars, and experimentally tests its usability within digital virtual environments involving human-digital agent interaction. Specifically, the research objectives of this project include: (1) synthesis of expressive talking faces and modeling of dynamic facial expressions, (2) synthesis of socially engaging non-verbal facial gestures, and (3) in-depth usability studies on resultant avatars.

Digital immersive virtual environment technology has enormous implications for human-computer interaction. Many qualities of digital human representations, particularly those of human-appearing agents, are important for social engagement and social influence. In particular, non-verbal behaviors play a critical role. Among such behaviors, arguably the most important are facial expressions of emotion, which are critical for meaningful renderings of digital agents. To date, computational models that would permit such renderings are less than optimal. Indeed, an applicable and systematic computational model for rendering spontaneous, on-the-fly non-verbal facial gestures and integrating them with speech has not been created.

The success of this proposed project will remove a major barrier to the widespread application of useful digital human representation technology for all applications in which computer-mediated communication can play a role, including commerce, education, health, engineering, and entertainment applications. In addition, it will have far-reaching scientific implications, providing a computationally tractable mechanism for embedding human qualities into computer-controlled entities that are used in other scientific and engineering fields.

Project Report

Intellectual merit outcomes: (1) Live speech driven facial animation techniques: Live speech driven generation of lip-sync, head movement, and eye movement has been a very challenging problem in graphics and animation community for decades, due to: (i) "Live speech driven" implies the real-time performance of the algorithm; hence, its runtime performance needs to be highly efficient. (ii) Since only the prior and current information enclosed in live speech is available as the runtime input, many widely used dynamic programming schemes for generating lip-sync and facial gestures cannot be directly used for the task. In this project, the PI developed a novel, fully automated framework to generate realistic and synchronized eye and head movement on-the-fly based on live or prerecorded speech input (IEEE Transactions on Visualization and Computer Graphics, 2012, 18(11), pp. 1902-1914). In particular, the synthesized eye motion in the framework includes not only the traditional eye gaze but also the subtle eyelid movement and blink. Comparative user studies showed that the new approach outperformed various state-of-the-art facial gesture generation algorithms, and its results are measurably close to the ground truth. For live speech driven lip-sync, the PI developed a practical phoneme-based approach for live speech driven lip-sync (IEEE Computer Graphics and Applications, 2014 (accepted)). Besides generating realistic speech animation in real-time, the developed phoneme-based approach can straightforwardly handle speech input from different speakers. Compared to various existing lip-sync approaches, the main advantages of the developed approach are its efficiency, simplicity practicalness, and capability of handling live speech input in real-time. In addition, A long-standing problem in marker-based facial motion capture is what are the optimal facial mocap marker layouts. The PI developed an approach to compute optimized marker layouts for facial motion acquisition as optimization of characteristic control points from a set of high-resolution, ground-truth facial mesh sequences (IEEE Transactions on Visualization and Computer Graphics, 2013, 19(11), pp. 1859-1871). (2) Perceptual models for facial animation: Although various efforts have been attempted to produce realistic facial animations with humanlike emotions; nevertheless, how to efficiently measure and synthesize believable and expressive facial animations is still a challenging research topic. The PI investigated how to build computational models to quantify the perceptual aspects of computer generated talking avatars. In order to address two unresolved research questions in data-driven speech animation: (i) can we automatically predict the quality of dynamically synthesized speech animations without conducting actual user studies? And (ii) can we dynamically compare and determine which algorithm (among them) can synthesize the best speech animation for specific text or speech input? The PI developed a novel statistical model to automatically predict the quality of synthesized speech animations on-the-fly generated by various data-driven algorithms (IEEE Transactions on Visualization and Computer graphics, 2012, 18(11), pp. 1915-1927). Second, previous studies were primarily focused on qualitative understanding of human perception on avatar head movements. The quantitative association between human perception and the audio-head motion characteristics of talking avatars remains to be uncovered. The PI quantified the correlation between perceptual user ratings (obtained via user study) and joint audio-head motion features as well as head motion patterns in the frequency-domain (ACM CHI Conference 2011, pp. 2699-2702). (3) Example-based skinning decomposition and compression: Linear Blend Skinning (LBS) is the most widely used skinning model in entertainment industry practices to date. The PI developed a new Smooth Skinning Decomposition approach with Rigid Bones (SSDR), an automated algorithm to extract the LBS from a set of example poses (ACM Transactions on Graphics, 2012, 31(6), article 199). The developed SSDR model can effectively approximate the skin deformation of nearly articulated models as well as highly deformable models by a low number of rigid bones and a sparse, convex bone-vertex weight map. On top of the SSDR work, the PI further developed an example-based rigging approach to automatically generate LBS models with skeletal structure (ACM Transactions on Graphics, 2014, 33(4), article 84). Its output can be directly used to set up skeleton-based animation in various 3D modeling and animation software as well as game engines. To speed up the performance of the LBS model, the PI also developed an efficient two-layer sparse compression technique to substantially reduce the computational cost of a dense-weight LBS model, with insignificant loss of its visual quality (ACM Transactions on Graphics, 2013, 32(4), article 124). Broader impact outcomes: This project had funded (in part) three PhD students in the PI’s group who joined leading industry companies after their PhD graduation, including Disney, Industrial Lights and Magic, and AMD. Several MS students and undergraduate minority students have participated in the research of this project through summer-based or REU research experience. Also, the PI mentored a high school student from a local high school in the greater Houston region. A Youtube channel (www.youtube.com/uhcgim) has been created to disseminate the research outcomes to the public.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0914965
Program Officer
William Bainbridge
Project Start
Project End
Budget Start
2009-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2009
Total Cost
$259,147
Indirect Cost
Name
University of Houston
Department
Type
DUNS #
City
Houston
State
TX
Country
United States
Zip Code
77204