New sequencing technologies have made genomics a big data science. These data have complexity and represent many variables. In trying to get biological information from genomic sequence, it is often necessary to reduce the complexity. There are a number of different approaches to use computationally, but these often introduce errors because of assumptions made about the data. This project will lead to the development of novel approaches specific to the type of genomic data collected. One of these types of data represents the DNA sequence and the other comes from natural modifications to the sequence when genes are expressed. These new methods will identify important differences more accurately in the two data types by correctly modeling unique properties of these data in a statistical framework. Methods developed during this project will have a great impact on the genomics field, where researchers may discover the genetic basis of complex diseases. The broader impacts of this project are gaining a deeper insight into the genetic basis of complex diseases, distributing the novel methods through public webservers and software tools for academic research and educational purposes, and training undergraduate students, graduate students, and postdoctoral scholars. In particular, this project will provide training to underrepresented groups with a summer intensive program that recruits minorities traditionally underrepresented in STEM fields.
Discovering a low dimensional structure from the high dimensional genomic data is a very important procedure in genomic studies because this structure may infer unknown confounding factors in genomic data as well as other important properties of data such as ethnicity of individuals. There are several dimensionality reduction methods prevalently used in the genomics, they may not generate an accurate low dimensional structure from genomic data because their underlying assumption on the statistical model is often violated in the data. This project proposes to develop dimensionality reduction methods aimed for genomic data, especially for methylation and genotype data. These methods will incorporate unique properties present in genomic data such as the discrete nature and correlation structure of genotype data, and different methylation patterns across different cell types and tissues. This project will also analyze asymptotic behavior of the novel methods using random matrix theory. Three strategies will be used to validate the methods. First, for all genomics applications, there are datasets where there is gold standard information, Second, simulated data based on current practices in the genomics community will be used to perform evaluate genomics applications. For example, it is standard in the community to simulate the genetics of admixed individuals by combining the genotypes of individuals of known ancestry from a reference dataset such as the 1000 Genomes project. Third, the team will evaluate the general algorithms by generating simulated data using various generative models to validate that the algorithms have the asymptotic behavior expected and also examine how these algorithms perform when their assumptions are violated. The methods will contribute both to the statistical field by improving current low dimensionality methods and to the genomics field by releasing software tools. The broader impacts of this project are gaining a deeper insight into the genetic basis of complex diseases, distributing the methods through public webservers and software tools for academic research and educational purposes, and training undergraduate students, graduate students, and postdoctoral scholars. In particular, this project will provide training to underrepresented groups with a summer intensive program that recruits minorities traditionally underrepresented in STEM fields.