There is a crisis of reproducibility and replicability of scienti?c results. This crisis is an increasing source of concern both in the scienti?c and poplar press. The crisis is so acute that the United States Congress is currently investigating reproducibility of the scienti?c process. At the heart of the crisis is a shortage of data analytc skill throughout the scienti?c enterprise. There is an emerging consensus that the best way to address the crisis is to increase data analytic training, particularly around reproducibility and replicability. In this application we (1) propose the ?rst formal statistical model for reproduciility and replicability and then use data and experiments from the largest massive online open program in data science in the world to (2) perform randomized studies to improve our knowledge about which statistical methods and protocols lead to increased reproducibility and replicability in the hands of average users and (3) to analyze learner, course, and content characteristics that increase learner success and throughput to increase the number of trained data analysts worldwide. To accomplish goals (2) and (3) we will use the largest and highest throughput data science program in the world: the Johns Hopkins Data Science Specialization. This specialization, developed by the investigators of this project, consists of nine courses that are offered every month. Since the launch of this program in April 2014, these classes have seen more than two million enrollments and nearly all their experiences have been recorded as data. Furthermore, the MOOC platform for this series permits random assignment of quiz questions and content. We will disseminate our results through open source software, analysis protocols, our popular blog, and the Data Science Specialization to maximally improve data science training and reduce the scienti?c replication and reproducibility problem. The size of ths program means that by increasing quality of the program and the number of completing students by even a small percentage we can affect global data analytic behavior.

Public Health Relevance

Many scienti?c results cannot be replicated or reproduced. One reason for this crisis is a shortage in the quantity and quality of trained data analysts acros all medical and scienti?c areas. We propose to de?ne a formal statistical model for reproducibility and replicability, then use the world's largest data science program to identify statistical methods and data analyst characteristics that improve scienti?c reproducibility and replication.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
1R01GM115440-01A1
Application #
9100338
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Marcus, Stephen
Project Start
2016-04-01
Project End
2020-03-31
Budget Start
2016-04-01
Budget End
2017-03-31
Support Year
1
Fiscal Year
2016
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205
Patil, Prasad; Peng, Roger D; Leek, Jeffrey T (2016) What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science. Perspect Psychol Sci 11:539-44