Recent years have witnessed the development of large research projects that involve genotyping hundreds of thousands of individuals, on which we have available detailed medical records. Examples include the All of us research project, the Million Veteran Program, and the UKBiobank resource. Often, whole-genome sequencing data is also available for a substantial fraction of the individuals. These large samples, with their precise genotypic and phenotypic information, give us the opportunity to bring our understanding of the relations between genetic variation and traits of medical interest to the next level. While the initial small sample sizes available for genome wide association studies (GWAS) motivated analyses that were approximative in nature, we are now in the position to probe more closely the genetic causal mechanisms underlying medically relevant phenotypes. We can aspire to distinguish variants that have causal effects from those that are associated because of linkage disequilibrium or population structure. Indeed, we need to pay even greater attention to the implications of hidden confounders: even small effects become significant when sample sizes are large enough. Increasing the resolution with which we can describe causal mechanisms will result in the identification of clearer targets for drug development. It will also improve the precision of personalized risk evaluations based on genotypes: if we can construct risk scores using variants that are truly causal, their performance will remain solid across ethnicities and environmental exposures. To zoom in on genetic variants with causal effects, this project will leverage a set of new statistical methodologies that the investigators have recently introduced. These new approaches are remarkably flexible, in that they do not rely on specific assumptions of how phenotypes are linked to genetic variants. Indeed, they allow researchers to capitalize on powerful machine learning algorithms and, crucially, equip their results with precise replicability guarantees. We have assembled a diverse and complementary team, including experts in statistical genomics, methodological statistics and computer science, with a strong record both of software development and genetic data analysis. A postdoctoral scholar and two graduate students will contribute to the research program, and the interdisciplinary training they will acquire in statistics, computation and genetics will add another substantial benefit.

Public Health Relevance

Personalized medicine strives to provide treatments that are fine-tuned to the patients, aware both of the specificity of their diseases as well as of the individuals? background characteristics. In order to achieve this promise, identifying those genes and mutations that influence the medically relevant traits is an important tool: it gives us the opportunity to understand the biological pathways involved, inspires drug development, informs treatment and counseling and facilitates prevention. The proposed research would develop new methods of statistical analysis to harness the information contained in large datasets, increasing our ability to distinguish the genetic mutations that directly impact phenotypes, from those that are merely correlated with traits of interest.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
High Priority, Short Term Project Award (R56)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Ramos, Erin
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Social Sciences
Schools of Medicine
United States
Zip Code