This proposal responds to the urgent need for advances in data science so that the next generation of scientists has the necessary skills for leveraging the unprecedented and ever-increasing quantity and speed of biomedical information. Big Data hold the promise for achieving new understandings of the mechanisms of health and disease, revolutionizing the biomedical sciences, making the grand challenge of Precision Medicine a reality, and paving the way for more effective policies and interventions at the community and population levels. These breakthroughs require highly trained researchers who are proficient in biomedical Big Data science and have advanced skills at collaborating effectively across traditional disciplinary boundaries. To meet these challenges, UC Berkeley proposes an innovative training program in Biomedical Big Data for advanced Ph.D. students. This training grant will support 6 trainees. We anticipate further extending the reach of our program by admitting up to 2 additional students on alternative support, thus benefitting 8 students per year. The 25 participating faculty have extensive experience with biomedical applications and expertise in biostatistics, causal inference, machine learning, the development of Big Data tools, and scalable computing. Together, they span 8 departments/programs: Biostatistics; Computational Biology; Computer Science; Epidemiology; Integrative Biology; Molecular & Cell Biology; Neuroscience; and Statistics. We will recruit participants from Ph.D. students in their second or third year of study in any/all of these departments. Those accepted into the program will participate in an intensive year of training courses, seminars, and workshops, beginning with introductory seminars in late summer and ending with a capstone project by each participant in the spring. Each trainee will be assigned a secondary thesis advisor with biomedical Big Data science expertise complementing that of the primary thesis advisor. Specialized training will focus on three pillars: (1) translation of biomedical and experimental knowledge and scientific questions of interest into formal, realistic problems of causal and statistical estimation; (2) scalable Big Data computing; and (3) targeted machine learning with causal and statistical inference. Activities will include courses in machine learning targeted learning, statistical programming, and Big Data computing, as well as workshops led by the Berkeley Data Science Institute, Statistical Computing Facility, and Berkeley Research Computing. The capstone course will involve a collaborative project in biomedical science involving the integrated and combined application of skills acquired by the trainees in the three foundational areas. Trainees will also benefit from group seminars, retreats, and interdisciplinary meetings that build a core identity with the cadre and the program. This proposal dovetails with several data science and precision medicine initiatives at UC Berkeley and comes at an ideal time to influence how data science is taught to all graduate students, focusing on biomedical research across campus.

Public Health Relevance

Big Data is revolutionizing research in human health and medicine from the design of observational and experimental studies to its analysis. Our Biomedical Big Data Training Program will train the next generation of data scientists in biomedicine with a rigorous education in translation of real-world problems into a realistic causal and statistical estimation problem, computer science, targeted machine learning, and statistical inference.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Institutional National Research Service Award (T32)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Berkeley
Schools of Public Health
United States
Zip Code
Mok, Amanda; Rhead, Brooke; Holingue, Calliope et al. (2018) Hypomethylation of CYP2E1 and DUSP22 Promoters Associated With Disease Activity and Erosive Disease Among Rheumatoid Arthritis Patients. Arthritis Rheumatol 70:528-536
Basu, Sumanta; Kumbier, Karl; Brown, James B et al. (2018) Iterative random forests to discover predictive and stable high-order interactions. Proc Natl Acad Sci U S A 115:1943-1948