CI-P:Collaborative Research: The Speech Recognition Virtual Kitchen
This project provides a "kitchen" environment to promote community sharing of research techniques, foster innovative experimentation, and provide solid reference systems for automatically recognizing speech, as a tool for education, research, and evaluation. The research infrastructure is built around virtual machines (VMs), which can be reconfigured and shared easily. We liken the virtual machines to a "kitchen" because they provide the environmental infrastructure into which one can install "appliances" (e.g., speech recognition toolkits), "recipes" (scripts for creating state-of-the art systems for a toolkit), and "ingredients" (spoken language data), along with "dishes" (completed experiments with log-files for reference).
The planning project engages the community to tackle some of the issues that have previously hampered efforts in cross-community sharing, including distribution methods and intellectual property issues. The project also provides an example architecture, which serves as a focus point for community-wide discussion.
In terms of broader impacts, the project engages researchers and educators that typically do not participate in automatic speech recognition (ASR) research by providing travel scholarships to a workshop at INTERSPEECH2012. In a wider scope, the infrastructure may be useable by other data-intensive fields (synthesis, dialog systems, NLP, computer vision, data mining). By providing a permanent, publicly available resource for research, education, and evaluation in ASR research, we can better train the next generation of undergraduates and graduates. The "kitchen" gives them easy access to a large number of state-of-the-art implementations, and facilitates deeper analysis of algorithms and better comparisons across systems.
Building and maintaining a state-of-the-art Automatic Speech Recognition (ASR) system has moved beyond the ability of a single developer. It is difficult for all but the largest of University laboratories to maintain an end-to-end system, and adapt it to new languages, tasks, or conditions as required. Other researchers, who are not experts in speech recognition, find it impossible to use ASR in non-English languages, non-mainstream dialects, for distant microphones, or with children’s speech – simply because no such recognizer is available off-the- shelf. What has been missing is a way for academic institutions (and industry) to leverage community resources in order to branch off new research from fully functional end-to-end system configurations, rather than a collection of individually downloaded tools, scripts and data. Even if a well documented, state-of-the-art open-source ASR toolkit is being used, a "black box" approach usually results in poor performance. This project attempts to extend the model of lab-internal knowledge transfer to a community-wide effort through the use of Virtual Machines (VMs). We design an infrastructure to share entire ready-to-run baseline "recipes" together with data, log-files, results, etc. – in a working environment, and with links to other users that work on exactly the same task across the world. Students and researchers can then modify recipes step by step, observing the effect of changes. Testing a system on different data becomes almost trivial, and retraining a system becomes very easy, because a working training setup is available for comparisons. The "Speech Recognition Virtual Kitchen" first serves as a repository for Virtual Machines, which will typically be based on redistributable operating systems such as the Ubuntu Linux derivative, to provide a common infrastructure for the use of tool-kits and data. A user would download a VM from the "Kitchen Server" onto his "Host PC", and run it. Using the kitchen, any changes he makes to the VM can be compiled into a software package, and shared with other users that are running the same VM. Results can be uploaded to the kitchen, and displayed in a "high score" table, showing how well individual users are doing, and offering an incentive to continue for students. Class projects can be shared easily using VMs, which also allow for "versioning", which could be used to distribute example solutions to students. The present CRI-P planning grant developed the idea and organized a workshop to collect community input. Example VMs and concepts for setting up the kitchen server for minimum data transfer were created. A follow-up CRI grant is currently implementing the "Speech Recognition Virtual Kitchen", which we expect to go public in 2014. The "Speech Recognition Virtual Kitchen" represents an easy mechanism to create, maintain, and distribute high quality experiments in the speech and language area, which can be used at all levels of education, to distribute baselines for evaluation, and to promote the use of speech recognition in related fields such as information retrieval, user interfaces, robotics, etc. The outcome will be a better educated future work force in a field that is critical for man machine communication, information access or analysis, ICT for development, and many other uses.