The objective of this research is both to advance the field of video data processing (more particularly gesture recognition) and to illustrate the power of deep learning architectures and transfer learning. The approach is to organize a challenge culminating in a life evaluation at the site of a conference. Intellectual merit: Much of the recent research in Adaptive and Intelligent Systems (AIS) has sacrificed the grand goal of designing systems ever approaching human intelligence for solving data mining tasks of practical interest with more immediate reward. This project gives an opportunity to deep learning architectures inspired by neural networks to demonstrate their ability to address more complex problems requiring to transfer knowledge from task to task (transfer learning), leveraging the availability of video data not directly related to the target task of gesture recognition. The participants will also be involved in a data exchange to grow an unprecedented large and diverse database of gestures. Broader Impact: Challenges have proved to be a great stimulus of research. For a long lasting impact, the challenge platform and the data and software repositories will remain open beyond the term of the NSF funded project. The educational components of the project include engaging students in the contest, providing material directly usable in teaching curricula, and demonstrating gesture recognition to high school students to expose them to computer vision research and sign language communication. Our connections with the deaf community will allow us to gear the product of this research to advance assistive technology.
Gesture recognition is an important sub-problem in many computer vision applications, including image/video indexing, robot navigation, video surveillance, computer interfaces, and gaming. With simple gestures such as hand waving, gesture recognition could enable controlling the lights or thermostat in your home or changing TV channels. The same technology may even make it possible to automatically detect more complex human behaviors, to allow surveillance systems to sound an alarm when someone is acting suspiciously, for example, or to send help whenever a bedridden patient shows signs of distress. Gesture recognition also provides excellent benchmarks for Adaptive and Intelligent Systems (AIS) and computer vision algorithms. The recognition of continuous, natural gestures is very challenging due to the multimodal nature of the visual cues (e.g., movements of fingers and lips, facial expressions, body pose), as well as technical limitations such as spatial and temporal resolution and unreliable depth cues. Technical difficulties include tracking reliably hand, head and body parts, and achieving 3D invariance. The competition we organized in the context of this NSF sponsored project helped improve the accuracy of gesture recognition using Microsoft Kinect(TM) motion sensor technology, a low cost 3D depth-sensing camera. Intellectual merit: Much of the recent research in Adaptive and Intelligent Systems (AIS) has sacrificed the grand goal of designing systems ever approaching human intelligence for solving data mining tasks of practical interest with more immediate reward. Humans can recognize new gestures after seeing just one example (one-shot-learning). With computers though, recognizing even well-defined gestures, such as sign language, is much more challenging and has traditionally required thousands of training examples to teach the software. One of our goals was to evaluate whether transfer learning algorithms, which can exploit miscellaneous data resources, can improve the performance of systems designed to work on new similar tasks (e.g. recognize a new vocabulary of gestures). To see what the machines are capable of, we launched in 2012 a competition sponsored by ChaLearn with prizes donated by Microsoft. The challenge helped narrow down the gap between machine and human performance. In two rounds each lasting four months, the challenge attracted a total of 85 teams making 935 entries. They lowered the error rate, starting from a baseline method making more than 50% error to 7% error. The winner of the challenge, Alfonso Nieto Castanon, used a method he invented, which is inspired by the human vision system. He and the second and third place winners in either round were awarded $5000, $3000 and $2000 respectively and got an opportunity to present their results at the CVPR 2012 and ICPR 2012 conferences. We also organized demonstration competitions of gesture recognition systems using Kinect(TM) in conjunction with those events, with similar prizes donated by Microsoft. Novel data representations were proposed to tackle with success in real time the problem of hand and finger posture recognition. Broader impact: Challenges have proved to be a great stimulus of research in machine learning, pattern recognition, and robotics. Our main activities were the collection of a large dataset made publicly available, and the organization of two rounds of a quantitative challenge and qualitative demonstration competition for two major IEEE conferences. Our activities also included the dissemination of methods by editing a special topic of the Journal of Machine Learning research (JMLR), which will also be published as a book. We are in the process of completing the library of tools addressing the problems of the challenge with algorithms used by the winners. For a long lasting impact, the challenge platform, the data and software repositories will remain available beyond the term of the NSF funded project. Awards in the form of travel grants gave the opportunity to students to present their results in front of an audience of experts. Microsoft will be evaluating successful participants in both challenge rounds for two potential IP agreements of $100,000 each. With Microsoft interested in buying the intellectual property, the hope is that the new algorithms that emerged from the contest not only boosted accuracy but also will open the doors to a whole new range of applications. The demonstration competition winners demonstrated systems capable of accurately tracking in real time hand postures in application to touch free exploration of 3D medical images for surgeons in the operating room, finger spelling (sign language for the deaf), virtual shopping, and game controlling. Credits: The challenge evaluation website was hosted by Kaggle. The challenge was initiated by the US Defense Advanced Research Projects Agency (DARPA) Deep Learning Program and is supported by the US National Science Foundation under grant ECCS 1128436, the European Pascal2 network of excellence, Microsoft and Texas Instruments. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors and funding agencies.