This research project aims to make large amounts web-based music information accessible to researchers, scholars and students from a range of disciplines pursuing a variety of goals. The approach proposed here involves applying new, untested and exploratory computational approaches to extremely large corpora of digital music in combination with sets of proven algorithms and tools for effectively extracting features from recorded music. These developments together with successful efforts to create and deploy online massive online music corpora have created circumstances in which music analysis can be greatly expedited because of automation in critical areas. This project also intends to expand the corpus of scores available (Open Source) and create new tools for analyzing them.
SALAMI (Structural Analysis of Large Amounts of Data) is an innovative and ambitious computational musicology project. It is a 2009 Digging Into Data Project. To date, musical analysis has been conducted by individuals and on a small scale. Our computational approach, combined with the huge volume of data now available from such source as the Internet Archive, will a) deliver a very substantive corpus of musical analyses in a common framework for use by music scholars, students and beyond; and, b) establish a methodology and tooling which will enable others to add to this in the future and to broaden the application of the techniques we establish. A resource of SALAMI’s magnitude empowers musicologists to approach their work in a new and different way, starting with the data, and to ask research questions that have not been possible before. The SALAMI Project created several sets of data that will have long term broader impacts in the fields of computational musicology, music information retrieval, music informatics, machine learning, computer science and electrical engineering. Both senior researchers and their students are using the new SALAMI data sets. The first set of data was created by trained musicology students who analysed several thousand pieces of music audio. These talented students create what is known as "ground-truth" data; that is, they used their musical skills to determine where each section in the music began, ended and where appropriate, repeated. These "ground-truth" files are now being used to test the accuracy of music segmentation algorithms created by student and senior researchers. This data set is now used to formally evaluate the success of submitted algoriths at the annual Music Information Retrieval Evaluation eXchange (MIREX) Audio Structure Segmentation task competition. For more information on MIREX, see: http://nema.lis.illinois.edu/nema_out/mirex2012/results/struct/sal/. One interesting intellectual merit finding that comes from the "ground-truth" data set is that human experts agree on their segmentation decisions roughly 72% of the time. This finding tells us that best automated approaches which score approximately 57% accuracy need about 15% more improvement to mimic human performance. The second set of data is a massive run of segmenation algorithms against 23,000 hours (about 250,000 hours) of music audio. Seven different community-created algorithms were run, each against the whole music collection. The derived segmentation data is now available to the research community as sets of text-based segmentation files. Thes files are available via an Open-Linked Data endpoint that can be search via the SPARQL search technology. Subcollections are also available via web-base download with sets representing music sources or the algorithm used to generate the data. More information is available at: http://salami.lis.illinois.edu/.