This project develops state-of-the-art machine learning methods to describe and quantify scientific research. The particular approach taken is to develop new topic models that can learn underlying research categories across a wide variety of text data sources including NSF and NIH grant awards, scientific publications, and US patents. The important innovation is the building of the technology to permit feedback from domain experts and end users. The approach is potentially transformative in that it can potentially overcomes some of the known limitations of current topic modeling approaches by improving the quality and utility of topics across diverse data sources.

The web-based tool displays and manipulates learned topics so that users can apply it to create comprehensive overviews of scientific funding, research, and production, including the answers to questions like: - What types of science are funded by NSF, NIH and other agencies? - What types of science are produced by funded investigators? - What types of science are described in US patents?

The tool also provide users with answers to more complex questions about the science of science and the relationship between funding and scientific achievements, tracking trends, describing funding programs, and identifying funding overlap (across agencies, or even within agencies).

Intellectual Merit: The proposed research advances techniques and methods for tracking scientific research in several ways. First, it advances the development of unsupervised statistical topic models to categorize, describe, and measure scientific research. Second, it addresses known problems with topic modeling for this type of application, such as improving the coherence of all topics, and making topics transcend different types of document collections (grants, publications, patents). Users can more directly measure impacts of funding as a result of being able to make use of unified topics from grants, publications, and patents. Third, the research develops evaluation frameworks that shift the focus from machine learning metrics to the needs of domain experts and end users.

Broader Impacts: This work has an array of broader impacts. It creates useful data for funding agency staff, researchers, interested public, government bodies, media and other stakeholders. The web based tool allows users to create custom-based data sets, tailored to their particular needs. Such data sets allow users to answer an array of science of science policy questions. The knowledge created in this work supports initiatives such as STAR METRICS to document the value of investments in scientific research.

Project Report

This project developed machine learning methods to describe and quantify scientific research. We developed new topic models that learn underlying research categories across a wide variety of text data sources including grant awards and scientific publications. We addressed some known limitations of current topic modeling approaches and improved the clarity, coherence, quality and utility of machine learned topics across diverse data sources. We also developed a state-of-the-art method for extracting technical terminology and concepts from scientific literature. Our topic models were used to answer questions about the science of science and the relationship between funding and scientific achievements, tracking scientific trends, and describing funding programs. We improved the coherence of learned topics, which better allowed users to more clearly identify categories of research. We created useful data for funding agency staff (in particular, the learning of the NSF topic model which was used in the NSF portfolio explorer). Our data sets allowed users to answer science of science policy questions, and allowed analysts to more accurately describe and measure the value of investments in scientific research. Using one decade of NSF award abstracts, and five years of full text proposals submitted to NSF, we created a series of topic models to describe NSF research (using a combined collection of over 300,000 proposals and awards). Topic models were reviewed by NSF, and a topic model was selected for use. Since the learning of topics in the topic model was fully automated, a key component of this work was the incorporation of feedback from domain experts and end users. We collected feedback from program directors on the utility of topics, and the types of analyses needed. We also conducted a community topic labeling exercise, where topic descriptions were assigned a short 2-3 word label. Findings on this deliverable include validation from NSF that our automatically learned topics are good at characterizing the research, and that the research facets identified by the topic modeling were interesting and useful. Another finding from our project was the ability of the topic model to identify different types or classes of topics. From the 1000 topics, we identified that approximately 800 topics are clear 'research' themes, and useful for describing the research content of a proposal. An additional 60-70 topics pertain to NSF administrative categories and/or language describing Broader Impacts. These topics are also useful to track the ways PIs write about different broader impacts criteria. Approximately 60-70 topics are very general descriptions of research, and the mechanics of conducting research. The remaining 60-70 topics are unusable (e.g. list of PI names or names of institutions). The overall response to the utility and usability of the learned topics was positive. We demonstrated that topics are a useful representation to show changes in research focus over time, e.g. of NSF programs. This EAGER award was a key precursor to ideas that went into a larger successful 3-year NSF SciSIP award entitled: 'Balancing the Portfolio: Efficiency and Productivity of Federal Biomedical R&D Funding.' Together with Economist Meg Blume-Kohout, we have continued to further develop topic models and combine them with econometric models. Our cross-disciplinary, cross-institution collaborative research project combines economic analysis with methods from statistical machine learning, to assess the relative efficiency and efficacy of research and development expenditures at NIH. Research outcomes under this award was broadly disseminated at various international conferences and top-ranking journals.

Agency
National Science Foundation (NSF)
Institute
SBE Office of Multidisciplinary Activities (SMA)
Type
Standard Grant (Standard)
Application #
1106434
Program Officer
Joshua Rosenbloom
Project Start
Project End
Budget Start
2011-02-15
Budget End
2013-01-31
Support Year
Fiscal Year
2011
Total Cost
$162,774
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697