Complex human diseases and related quantitative traits are the interplay of many risk factors, including genetic and environmental components. Gene-environment interaction studies are a general framework that can be used to identify genetic variations that modify environmental, physiological, lifestyle, or treatment effects, as well as those contributing to age, sex, racial/ethnic disparities on complex traits. Moreover, genetic association studies accounting for gene-environment interactions are conducted to enhance our understandings on the genetic architecture of complex diseases by allowing for different genetic effects in different exposure strata. With the recent advances in technology and lowering costs, genetic and genomic data are being generated on very large scales. However, commonly used statistical software programs for gene-environment interaction studies were generally developed many years ago, and their computational algorithms have not been optimized to analyze hundreds of thousands to millions of samples from possibly complex study designs. To fill in the gap between current and future analytical needs in large-scale gene-environment interaction studies and current analytical solutions, we plan to (Aim 1) develop efficient algorithms for common variant gene-environment interaction analyses that scale linearly with the sample size;
(Aim 2) develop new statistical tests for rare variant gene- environment interaction analyses, in the mixed effects model framework for correlated samples;
and (Aim 3) implement proposed statistical methods and computational algorithms in open-source new software programs.
Our Aim 1 addresses current computational challenges in conducting gene-environment interaction studies in up to millions of samples.
In Aim 2, we plan to solve statistical and computational challenges in gene-environment interaction analyses of large-scale whole genome sequencing data, accounting for relatedness, complex study designs, as well as model misspecification.
Aim 3 focuses on software development and we will deliver well- documented and user-friendly software packages and analysis pipelines for large-scale gene-environment interaction studies. The methods and software programs will be applied to ongoing whole genome sequencing projects, as well as biobank-scale data, and they will significantly facilitate the use of large-scale genetic and genomic data for gene-environment interaction studies in upcoming years to better understand the genetic basis of complex cardio-metabolic, lung, blood, sleep diseases and their age, sex, racial/ethnic disparities, and promote personalized disease prevention and treatment strategies in precision health research.

Public Health Relevance

Gene-environment interactions play an important role in complex disease etiology. We propose to develop efficient statistical methods and computational algorithms for large-scale gene-environment interaction studies, and implement them in open-source software programs and cloud-based analysis pipelines, to facilitate gene- environment interaction research on complex cardio-metabolic, lung, blood and sleep diseases and related conditions using hundreds of thousands to millions of samples.

National Institute of Health (NIH)
National Heart, Lung, and Blood Institute (NHLBI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Papanicolaou, George
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Health Science Center Houston
Schools of Public Health
United States
Zip Code