Core B: Data Acquisition and Construction Our projects'data requirements overlap extensively, and an Aim of this program project is to provide data to catalyze work on aging and innovation and person-based studies of innovation in the broader research community. The data necessary to study these problems are currently scattered across sources and formats and have not been linked, posing a formidable barrier to research. The Data Acquisition and Construction Core will develop, maintain, and distribute a number of integrated, large-scale datasets and tools that will provide infrastructure for the project and be provided freely in a user-friendly form and with support to the scholarly research community (including graduate students and researchers at non-profits and government agencies) in perpetuity. Generating this infrastructure centrally will ensure it is fully integrated, minimize duplication of effort;ensure quality and uniformity;take greatest advantage of the expertise of program participants;and establish a common set of methods for all users. The availability of this data infrastructure and established procedures will support a dynamic field studying aging and innovation and person-based studies of innovation. A central component of our work will be the construction of a large-scale, disambiguated, individual-level, longitudinal database on biomedical researchers comprising: (1) publications, (2) patents, (3) grants, (4) citations, (5) biographic data, (6) research institution characteristics and quality rankings and (7) journal quality. We will also develop: (1) a longitudinal dataset on research areas, including research effort, drug approvals, and health outcomes, which can stand alone and will also be combined with the individual-level dataset;(2) a set of data extraction and manipulation tools that will facilitate the use of these datasets;(3) estimates of the health and economic impacts of biomedical research;and (4) metrics to identify high-impact and transformative research. The project draws together a team with complementary skills that is uniquely suited to perform this work along with a sophisticated group of end-users who can refine the data, add complementary components, and maximize usability.

Public Health Relevance

The US is increasingly emphasizing innovation, but the aging of our scientific workforce is expected to reduce innovative output. This Core will develop the data infrastructure to support both our work and future work that will provide policy-relevant information about how the aging of our scientific workforce will affect our biomedical innovative output, the associated health and economic consequences, and policy responses.

National Institute of Health (NIH)
National Institute on Aging (NIA)
Research Program Projects (P01)
Project #
Application #
Study Section
Special Emphasis Panel (ZAG1-ZIJ-9 (04))
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Bureau of Economic Research
United States
Zip Code
Prosperi, Mattia; Buchan, Iain; Fanti, Iuri et al. (2016) Kin of coauthorship in five decades of health science literature. Proc Natl Acad Sci U S A 113:8957-62
Mishra, Shubhanshu; Torvik, Vetle I (2016) Quantifying Conceptual Novelty in the Biomedical Literature. Dlib Mag 22:
Buffington, Catherine; Harris, Benjamin Cerf; Jones, Christina et al. (2016) STEM Training and Early Career Outcomes of Female and Male Graduate Students: Evidence from UMETRICS Data linked to the 2010 Census. Am Econ Rev 106:333-338
Knepper, Richard; Börner, Katy (2016) Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE). PLoS One 11:e0157628
Smalheiser, Neil R; Shao, Weixiang; Yu, Philip S (2015) Nuggets: findings shared in multiple clinical case reports. J Med Libr Assoc 103:171-6
Torvik, Vetle I (2015) MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. Dlib Mag 21:
Smalheiser, Neil R; Gomes, Octavio L A (2015) Mammalian Argonaute-DNA binding? Biol Direct 10:27
Zolas, Nikolas; Goldschlag, Nathan; Jarmin, Ron et al. (2015) Wrapping it up in a person: Examining employment and earnings outcomes for Ph.D. recipients. Science 350:1367-71
Shao, Weixiang; Adams, Clive E; Cohen, Aaron M et al. (2015) Aggregator: a machine learning approach to identifying MEDLINE articles that derive from the same underlying clinical trial. Methods 74:65-70
Cohen, Aaron M; Smalheiser, Neil R; McDonagh, Marian S et al. (2015) Automated confidence ranked classification of randomized controlled trial articles: an aid to evidence-based medicine. J Am Med Inform Assoc 22:707-17

Showing the most recent 10 out of 15 publications