High-throughput sequencing (HTS) data on the genomes of a diverse number of species are being produced at an unprecedented rate. However, the development of computational and statistical approaches for handling these data lags behind, creating a gap between the massive data being generated and the biological knowledge that could be gleaned. Here we propose to develop an integrated system for genetic variation detection, annotation and analysis for HTS data, therefore reducing the critical gap faced by the community.
In Aim 1, we will develop a hidden Markov model (HMM) based computational algorithm that incorporates multiple sources of information, including sequence depth, allelic dosage, population allele frequency and paired-end reads distance, for reliable yet efficient detection of copy number variations (CNVs). Given a large list of SNPs, indels and CNVs, researchers are faced with the challenge of identifying a subset of functionally important variants.
In Aim 2, we will develop a comprehensive functional annotation pipeline to annotate functional importance of coding and non-coding variants, utilizing database information from many large-scale genomics projects, and generate a "functional vector" for each variant. These functional vectors can help biologists interpret sequencing results and help statistical geneticists develop informed association tests using sequencing data. Appropriate statistical methods are needed to analyze population-level sequencing data, in order to identify genomic variants that may contribute to disease susceptibility or phenotypic variability.
In Aim 3, we will develop a hierarchical modeling strategy, which utilizes functional vector information for each variant, to perform association tests on genes, genomic regions, or biological pathways, such as ontology categories and gene regulatory/metabolic pathways. Finally, in Aim 4, we will test the properties of each approach via simulation and real data analysis, and develop, distribute and support freely available software packages implementing the proposed methods. We believe that well-documented and supported software implementations will allow other researchers to yield the maximum information from the methodological and scientific advances that result from this project. Successful completion of the aims will enable researchers to fully investigate the massive amounts of sequencing data that have been or will be generated, thus contributing to our understanding on how genetic variants influence phenotype variability.
Despite the rapid advancement of high-throughput sequencing (HTS) techniques, the development of computational and statistical approaches for handling these data lags behind, creating a gap between the massive data being generated and the biological knowledge that could be gleaned. Here we propose to develop an integrated system to detect variants, annotate variants and analyze them for genotype-phenotype associations. Successful completion of the aims will enable researchers to fully investigate the massive amounts of sequencing data that have been or will be generated, thus contributing to our understanding on how genetic variants influence phenotype variability.
|Wei, Wen-Hua; Guo, Yunfei; Kindt, Alida S D et al. (2014) Abundant local interactions in the 4p16.1 region suggest functional mechanisms underlying SLC2A9 associations with human serum uric acid. Hum Mol Genet 23:5061-8|
|Shi, Lingling; Chang, Xiao; Zhang, Peilin et al. (2013) The functional genetic link of NLGN4X knockdown and neurodevelopment in neural stem cells. Hum Mol Genet 22:3749-60|
|Gao, Fan; Wei, Zong; Lu, Wange et al. (2013) Comparative analysis of 4C-Seq data generated from enzyme-based and sonication-based methods. BMC Genomics 14:345|
|Chang, Xiao; Xu, Tao; Li, Yun et al. (2013) Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of 'date' and 'party' hubs. Sci Rep 3:1691|