The Human Genome Project and follow-on projects such as 1000 Genomes, GTEx, ENCODE, and TOPMed provide powerful resources to identify genes that influence human health and disease and variability in disease-related quantitative traits (QTs). Along with these resources have come increasingly efficient tools to genotype, sequence, and annotate the genome, and to support computation across these data. These resources and tools will be critical as we continue to explore the genetic basis of human disease and disease-related QTs. In this proposal, we describe statistical and computational problems that arise in human gene mapping, with a particular focus on sequence analysis, genotype imputation, and quality control. We describe statistical methods to address these problems and software tools and web services to facilitate their use. We will test resulting methods, tools, and web services via computer simulation and analysis of data from complex trait genetics studies in which we are involved. Specifically, we will: (1) develop tools to detect and estimate DNA sample contamination that are agnostic to genetic ancestry; (2) develop a test for Hardy-Weinberg equilibrium of sequence-based or imputed genotypes in the presence of population structure and robust to sample contamination; (3) enable more accurate variant filtering and genotype calling from DNA sequence data in the presence of population structure and/or sample contamination; (4) develop methods to detect sample contamination in RNA- and epigenomic sequence data; (5) extend the Michigan Imputation Server (MIS) to increase power of a sequence-based association studies by supporting use of external controls from existing sequence data resources, augmenting an existing imputation reference panel with the investigator's sequenced samples, and checking for contamination; and (6) document, distribute, and support efficient software tools to support these methods. Under separate funding, we will apply the resulting methods to help understand the genetic basis of type 2 diabetes and related QTs, and of schizophrenia and bipolar disorder. Success in these aims will enable more rapid identification of variants that predispose to human disease and account for variability in disease-related QTs, and has the potential to lead to new insights into basic biology and disease etiology, identify novel therapies, improve targeting of therapies, assist in disease classification, and support more accurate disease risk prediction. The modest cost of statistical and computational methods development, and the impact of these methods across many studies, makes our proposed research highly cost effective.
Studies to localize and identify genetic variants that predispose to human diseases and influence the variability of disease-related quantitative traits have the potential to inform breakthrough strategies to develop new drugs, to develop genetic tests to stratify risk, and to enable more targeted approaches to disease prevention and treatment. Efficient statistical and computational methods and software tools are critical for the success of such studies.