Many deaf and hard of hearing students use real-time captioning to participate in education. Generally, real-time captions are provided by skilled professional captionists (stenographers) who use specialized keyboards or software to keep up with natural speaking rates of up to 225 words per minute. But professional captionists are expensive and must be arranged in advance in blocks of at least an hour. Automatic speech recognition (ASR) is improving, but still experiences high error rates in real classrooms. In this collaborative effort involving the University of Rochester and Rochester Institute of Technology, the PIs will address these issues by blending human- and machine-powered captioning to produce captions on demand, in real time, for low cost. The PIs' approach is for multiple non-experts and ASR to collectively caption speech in under 5 seconds, with the help of interfaces which encourage quick, incomplete captioning of live audio. Because non-experts cannot keep up with natural speaking rates, new algorithms will merge incomplete captions in real time. (While the sequence alignment problem can be solved exactly with dynamic programming, existing approaches are too slow, are not robust to input error, and do not incorporate natural language semantics.) Systematically varying audio saliency will encourage complete coverage of speech. Non-expert captions will train ASR engines in real time, so that ASR may improve during a lecture. (Traditional approaches for ASR training assume that training occurs offline.) The quikCaption mobile application will embody these ideas and will be iteratively designed with deaf and hard of hearing students at the National Technical Institute of the Deaf (NTID) via design sessions, lab studies and in-class deployments. Non-expert captionists can be drawn from broad sources: volunteers willing to donate their time, classmates with relevant domain knowledge, or always-available paid workers. They may be local (in the classroom) or remote. Captionists may have experience from prior quikCaption sessions, or novice crowd workers recruited on demand from existing marketplaces (e.g., Mechanical Turk). A flexible worker pool will allow real-time captions to be available on demand at low cost and for only as long as needed.
Broader Impacts: This research will dramatically improve education for deaf and hard of hearing students by enabling access to serendipitous opportunities, such as conversations after class or last-minute guest lectures for which no interpreter or captionist was arranged. Real-time captioning will also be useful in other settings such as school programs, artistic performances, and political events. Older hard of hearing adults usually prefer captioning, and represent a sizable and growing population; hearing people may benefit because captioning is a first step in automatic translation of aural speech. The algorithms developed as part of this project for real-time merging of incomplete natural language will likely be adaptable for other applications such as collaborative translation or communication over noisy mediums.