This project will develop and evaluate an innovative research tool, based on Natural Language Processing (NLP) and Machine Learning (ML), to support qualitative social science research, specifically content analysis. Content analysis is a qualitative research technique for finding evidence of concepts of theoretical interest using text rather than numbers as its raw data. The process of identifying and labeling significant features in text is referred to as "coding," and the result of such an analysis is a text annotated with codes for the concepts exhibited. This technique has become increasingly popular and more applicable as the volume of available "born-digital" text has exploded. However, the reliance on manual analysis of the text limits the scale and scope of content analysis research.

In this project, the problem of coding qualitative data is conceptualized as an information extraction problem amenable to automation using NLP. However, rather than seeking to automate the process, the technologies will be used in a supporting role, creating a human-computer partnership. ML will be used to induce NLP rules from examples of coded text, avoiding the need to develop rules manually. To reduce the amount of training data needed from the human participants, an active learning process will be employed, in which a few hand-coded examples are used to create an initial model that can be further evolved through interaction with the user. These approaches will be combined in a prototype tool to support qualitative content analysis. As a demonstration and test of the tool, it will be applied to current and novel studies of cyber-infrastructure-supported distributed groups, specifically free/libre open source software development teams, and then to a broad range of social science research problems. This broad usage will also provide a test of the generalizability of a socio-computational approach to this problem.

The intellectual merit of the research is four-fold. First, the proposal seeks to develop a novel socio-computational system that supports a human-computer partnership through the integration of information extraction and active learning. Second, a validation study will apply the tool to a diverse set of codes, providing evidence of the generality and limits of a socio-computational approach. Third, the demonstration studies using the tool will contribute to research on distributed groups. Finally, the project addresses a fundamental methodological problem in the broad domain of qualitative research, namely dealing with large quantities of unstructured qualitative data, by applying innovative computer-support. By avoiding the need for hand-written rules and reducing the required amount of hand-annotated training data, this partnership will make practical the use of a system for coding large quantities of qualitative data in various domains.

The project has numerous broader impacts. It will benefit society by providing useful infrastructure for research in the form of a content analysis tool for scientific research and in for the form of corpora of annotated data for use in future Natural Language Processing research. The demonstration studies will provide generalizable knowledge to improve the effectiveness of distributed groups, an increasingly important mode of organization. Finally, the project contributes to the education and training, of women and minority group members in particular.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Syracuse University
United States
Zip Code