The goal of this research is to allow a (live) conductor to create a personal musical performance by controlling a computer-driven (virtual) orchestra through gesture captured on video. The immediate focus is to provide an educational tool for conducting students, or others with some serious musical training who are capable of communicating musical intent clearly through the traditional language of gesture used by conductors. However, the PI expects that less-schooled or novice users will also be able to learn from and find enjoyment with the outcome of this research, which will culminate in a computer system that runs on generic computer hardware, and will be made freely available. The system will take video of a conductor as input, reducing this input into a two-dimensional conducting trace that describes the movement of the tip of the conductor's baton over time. The system will perform real-time estimation of the conductor's precise "state" within the composition, using an approach that fuses hidden Markov model methodology with a Kalman filter model for musical timing. Using this on-line estimate, the system will predict the location of future musical events, thus addressing the inevitable issue of detection latency. Concurrently, the system will synthesize real-time audio to follow the conducted performance, using a previously recorded performance whose timing is continually warped using phase-vocoding. The initial focus of this work will be on musical timing rather than dynamics, articulation, etc., as this is the aspect of conducting that is most clearly communicated through motion and usually also that which affords the most expressive potential and sense of "ownership" of the performance. Educated musicians find surprising agreement when evaluating the accuracy with which a musician or ensemble follows a knowledgeable conductor, suggesting that the conductor's "signal" must be relatively unambiguous. Making mathematical sense of the relationship between this signal and its meaning constitutes a challenging dimension of this research.

Broader Impacts: This work will have lasting impact on conducting pedagogy, by providing a tireless and responsive laboratory for musical experimentation. The research will also make contributions to instrumental and voice pedagogy, by allowing a musician to focus on the interpretive aspects of a piece without simultaneously addressing the technical challenges. The problem of planning the orchestra's musical evolution with uncertain and continually evolving knowledge of the conductor's actions is deeply challenging; thus, this work has implications for the general domain of planning under uncertainty. Perhaps most importantly for society at large, a successful conducting system would bring the pleasure of music-making to a broad and international collection of users who might otherwise have little or no experience creating music.

Project Report

,explored the possibility of controlling a real-time music performancethrough gestures, similar to the way a conductor leads a performanceof live musicians. Our approach uses live video from the conductorand produces audio by resynthesizing an existing recording whileallowing the timing to vary. This work leverages our long-standingwork in accompaniment systems which follows a live musician byanalzying the live player's audio and determining where the notes havebeen played, and will be played. Under this grant we developed a rudimentary system that analyzes thevideo from a conductor employing a known beat pattern, recognizing thecondcutor's beat times, and controlling the output audio by predictingthe future beat given what is currently known. While our initiallyproposed apporach works well in the case of audio input, it fallsshort with video from a conductor. For one, a conductor typicallymakes only one gesture per musical beat, while the music oftencontains several notes per beat: thus the density of timinginformation is considerably lower for the conducting problem. Inaddition, conducting gestures often indicate beat times withconsiderably less clarity than one finds with soloist audio; oftenthis is intentional on the part of the conductor, as with legatomusic, in which the conductor seeks to hide individual beats, thusfocusing attention on longer time units such as phrases. Our response to this difficulty was to develop more sophisticatedmodels for musical timing than those used with our previousaudio-based accompaniment systems. Our earlier models are inspired byclassical position tracking approaches (e.g. Kalman filter) where themusical score position and tempo are modeled analogously to locationand velocity. Our newer model goes beyond the confines of Kalmanfilter approaches, introducing a hidden layer of variables thatdescribe the local musical state. This switching Kalman filter modelachieve better prediction by inferring the local intent for a shortsection of music (maintaining constant tempo, slowing down, etc.)Furthermore, the newer model capitalizes on the logic that governs theordering of these hidden states. We have explored our switchingKalman filter model in other musical applications, such as theautomatic identification and correction of performance errors and thevisualization of musical intent. A related challenge pursued through this grant is the understanding ofmusical expression through score analysis, rather than analysis ofperformance data. This line of research seeks a note-by-note labelingof a simple melody, describing each note's role in a larger prosodiccontext (stress, direction, grouping). As far as we know, this is thefirst attempt to explicitly represent musical expression itself, asopposed to the performance consequences (louder, slower, vibrato,etc.) of the expression. Our representation makes it possible toemploy statistical approaches that can learn from musical copora, andestimate convincing interpretations resulting in actual synthesis ofaudio. This work has commercial applications in computer games thatemploy music, as well as score-writing programs that produce morehuman-sounding performances. The core intellectual challenge posed by this project is therepresentation and understanding of musical expression, as it relatesto human-computer performance systems which require expressivesynthesis. Simple state space models that assume only smoothlyvarying tempo fail to relate any higher level musical ideas to theactual performance. While such models can be trained with acutalperformance data to predict reasonbly well, they come to each newmusical piece as a blank slate, without any accumulated musicalknowledge. Our current efforts address this difficulty using greatersophistication in the modeling of musical timing and prosody. Wedevelop models that explicitly incorporate higher level musical ideashaving origins in the way musicians talk and think about musicalperformance. While we have explored only a few of these applications(conducting, accompanyiment systems, synthesizing expressiveperformances) mostly uncharted territory remains. This work willfind commercial applications with score-writing programs, enhancedmusical synthesis by computer, musical tutoring systems, andhuman-computer systems for musical interaction.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0812244
Program Officer
Ephraim P. Glinert
Project Start
Project End
Budget Start
2008-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2008
Total Cost
$457,995
Indirect Cost
Name
Indiana University
Department
Type
DUNS #
City
Bloomington
State
IN
Country
United States
Zip Code
47401