With advancements in next-generation sequencing technologies, sequencing studies has become increasingly used in substance dependence (SD) research. These studies generate a massive amount of sequencing data and allow researchers to comprehensively investigate the role of a deep catalog of genetic variants in SD. Although the ongoing sequencing studies hold great promise for unraveling novel variants that contribute to SD, the high-dimensional data, low frequent variants, complex SD etiology, and heterogeneous SD phenotypes create tremendous analytic and computational challenges. Developing robust and powerful methods and computationally efficient software will address the challenges in SD sequencing data analysis and enhance our ability to identify new SD-related variants. The goals of this application are to develop new methods and software for designing and analyzing population-based and family-based sequencing data with single or multiple phenotypes, and to use them in collaborative research to investigate genetic variants and gene-gene/gene-environment (G-G/G-E) interactions associated with SD. Based on the preliminary simulation results, our central hypothesis is that the proposed methods are more computationally efficient than existing methods, and attain a more robust and powerful performance for various types of phenotypes. The planned specific aims are to: 1) develop a new non-parametric method for the design and analysis of sequencing data with one or multiple SD phenotypes; 2) develop a Joint-U method for high-dimensional G-G/G-E interaction analysis with SD sequencing data; 3) develop a family-similarity-U method for family-based SD sequencing data analysis, accounting for population stratification and rare variants enriched in families; and 4) facilitate the use of the new methods through software development and collaboration. The proposed research will be initiated by an early-stage new investigator (NIDA K01 awardee), who has assembled a team of scientists with expertise in statistical genetics, bioinformatics/software development, SD epidemiology, behavioral genetics, and clinical psychiatry. The successful completion of this project will address several important statistical and computational gaps in ongoing sequencing studies, and advance the methodology and software development for SD sequencing data analysis. The application of the new methods and software to large-scale SD sequencing datasets also holds promise for the discovery of new SD-associated variants and G-G/G-E interactions, which will ultimately lead to a better understanding of SD etiology, with resulting potential benefits for SD prevention and treatment.
The proposed research by a new, early-stage investigator will develop computationally efficient and powerful statistical tools for large-scale sequencing data, and will use these tools to investigate genetic variants and gene-gene/gene-environment interactions associated with substance dependence. The success of the project will address computational and analytical challenges associated with massive sequencing data, and will provide a new statistical framework for high-dimensional data analysis. The application of these tools to multiple SD sequencing datasets through collaborative research also holds promise for the discovery of new SD-associated variants, which will ultimately lead to a better understanding of SD etiology.