The manifestation of language in space poses special challenges for computer-based recognition. Prior approaches to sign recognition have not leveraged knowledge of linguistic structures and constraints, in part because of limitations in the computational models employed. In addition, they have focused on the recognition of limited classes of signs. No system exists that can recognize signs of all morphophonological types or that can even discriminate among these in continuous signing. Through integration of several computational approaches, informed by knowledge of linguistic properties of manual signs, and supported by a large existing linguistically annotated corpus, the team will develop a robust, comprehensive framework for sign recognition from video streams of natural, continuous signing. Fundamental differences in the linguistic structure of signs, distinguishing signed languages in 4D, with spatio-temporal dependencies and multiple production channels from spoken languages, are critical to computer-based recognition. This is because finger-spelled items, lexical signs, and classifier constructions, e.g., require different recognition strategies. Linguistic properties will be leveraged here for (i) segmentation and categorization of significantly different types of signs, and then, although this subsequent enterprise will necessarily be limited in scope within the project period, (ii) recognition of the segmented sign sequences. Through the 3D hand pose estimation from a team-developed tracker, w significant tracking accuracy, robustness, and computational efficiency will be attained. This 3D information is expected to greatly improve the recognition results, as compared with recognition schemes using only 2D information. The 3D estimated information from the tracking will be used in the proposed hierarchical Conditional Random Field (CRF) based recognition, to allow for tracking and recognition of signs that are distinct in their linguistic composition. Since other signed languages also rely on a very similar sign typology, this technology will be readily extensible to computer-based recognition of other signed languages.
This linguistically-based hierarchical framework for ASL sign recognition?based on techniques with direct applicability to other signed languages, as well?provides, for the first time, a way to model and analyze the discrete and continuous aspects of signing, also enabling appropriate recognition strategies to be applied to signs with linguistically different composition. This approach will also allow the future integration of the discrete and continuous aspects of facial gestures with manual signing, to further improve computer-based modeling and analysis of ASL. The lack of such a framework has held back sign language recognition and generation. Advances in this area will, in turn, have far-ranging benefits for Universal Access and improved communication with the Deaf. Further applications of this technology include automated recognition and analysis by computer of non-verbal communication in general, security applications, human-computer interfaces, and virtual and augmented reality. In fact, these techniques have potential utility for any human-centered applications with continuous and discrete aspects. The proposed approach will offer ways to address similar problems in other domains characterized by multidimensional and complex spatio-temporal data that require the incorporation of domain knowledge. The products of this research, including software, videos, and annotations, will be made publicly available for use in research and education.
The goal of this project was to advance computer-based sign recognition from video through use of state-of-the-art techniques for hand tracking and 3D hand pose estimation, combined with exploitation of knowledge about linguistic constraints that govern the internal structure of signs in American Sign Language (ASL). Sign recognition is challenging because of linguistic complexities inherent to signed language and difficulties in interpreting linguistic signals — occurring on multiple parallel channels (expressed through 3D configurations and movements of the hands, arms, face, and upper body) over varying timescales — from a 2D video projection. This is an important research area, since automated sign language recognition and generation hold great promise for improving communication between deaf and hearing individuals, enabling full access, and improving the lives of the deaf. In addition, advances in the ability to interpret human motion will have applications across a wide range of areas that require structured and unstructured human movement analytics (e.g., the behavioral sciences, security, HCI, graphics, computer vision). Accomplishments We have established a framework for recognition of signs, both in isolation (citation form) and in continuous signing, based on their linguistic properties. Substantial work was carried out through this grant for completion of the development of a large, linguistically annotated, lexical video corpus to support computer science and linguistic research on ASL signs [Carol Neidle, Ashwin Thangali, and Stan Sclaroff, Development of the American Sign Language Lexicon Video Dataset (ASLLVD) Corpus. LREC 2012]. The collection includes nearly 10,000 examples, corresponding to ~3,000 distinct base signs. This dataset is shared publicly; our Web-based Data Access Interface is being extended to allow access to this dataset: http://secrets.rutgers.edu/dai/queryPages/search/search.php . We have developed: 1. Novel methods for tracking the hands, first in 2D and then extended to 3D, which can deal with occlusions, variations in lighting, and complex hand configurations. The techniques for accurate 3D hand pose estimation are based on the use of discriminative methods for estimation of hand configuration from 2D video sequences. Hand configuration, particularly at the start and end of a lexical sign, is crucial for sign recognition. 2. Computational learning techniques for identifying the sign type in continuous signing. In signed (unlike spoken) languages, the internal structure of a word differs, depending on morphological class; that is, fingerspelled signs, lexical signs, and classifier constructions have significantly different internal structure. We have then focused our attention in this project on computer-based recognition of lexical signs, the predominant class. 3. A method for leveraging linguistic constraints to improve upon the results achieved in (2) for recognition of start and end hand configurations. We exploit the linguistic constraints by taking into consideration the statistical relationships between start and end handshapes as reflected in our ASLLVD corpus. 4. A framework that combines these techniques for continuous sign recognition based on the integration of our low-level image processing, hand and upper-body tracking, 3D hand modeling, and linguistic constraints into a unified stochastic time series learning framework. We have also done preliminary research on techniques for 3D tracking of upper body movement (and 3D methods for face tracking, as part of a related project), based -- for the first time -- on the coupling of discriminative and generative methods, from 2D video sequences. Furthermore, using new techniques for tracking from 2D video and for estimation of 3D upper body configuration of the torso, arms and hands, we can now extract 3D parameters related to signs, which will be used to further improve the continuous recognition of sign type and of the signs themselves. Our initial testing involves a learning model based on motion trajectories of the hands and upper body combined with 3D discriminative handshape classification (using start/end handshape linguistic constraints). The start/end handshape probability vector in combination with the motion trajectory vector forms our feature input. We trained for each handshape using a structural support vector machine SVM-HMM (Support Vector Machine - Hidden Markov Model). We train and test on the top 150 most frequently used ASL signs in our dataset (66% training / 33% testing) and achieve ~88.3% accuracy in identifying the correct ASL sign. Ultimately, we will fully integrate supervised ASL knowledge and structure into our recognition model. One advantage of our dynamic stochastic learning approach (combining bottom-up data-driven and top-down domain knowledge methods) over other research that is solely data-driven is scalability: our approach will allow accurate sign identification over a set of thousands of potential signs.