DNA sequencing has become an indispensable tool in many areas of biology and medicine. Recent techno- logical breakthroughs in next-generation sequencing (NGS) have made it possible to sequence billions of bases quickly and cheaply. A number of NGS-based tools have been created, including ChIP-seq, RNA-seq, Methyl- seq and exon/whole-genome sequencing, enabling a fundamentally new way of studying diseases, genomes and epigenomes. The widespread use of NGS-based methods calls for better and more efficient tools for the analysis and interpretation of the NGS high-throughput data. Although a number of computational tools have been devel- oped, they are insufficient in mapping and studying genome features located within repeat, duplicated and other so-called unmappable regions of genomes. In this project, computational algorithms and software that expand genomic accessibility of NGS to these previously understudied regions will be developed. The algorithms will begin with a new way of mapping raw reads from NGS to the reference genome, followed by a machine learning method to resolve ambiguously mapped reads, and will be integrated into a comprehen- sive analysis pipeline for ChIP-seq. More specifically, the three aims of the research are to develop: (1) Data structures and efficient algorithms for read mapping to rapidly identify all mapping locations. Unlike existing methods, the focus of this research is to rapidly identify all candidate locations of each read, instead of one or only a few locations. (2) Machine learning algorithms for read analysis to resolve ambiguously mapped reads for both ChIP-seq analysis and genetic variation detection. This work will develop probabilistic models to resolve ambiguously mapped reads by pooling information from the entire collection of reads. (3) A comprehensive ChIP- seq analysis pipeline to systematically study genomic features located within unmappable regions of genomes. These algorithms will be tested and refined using both publicly available data and data from established wet-lab collaborators. In addition to discovering new genomic features located within repeat, duplicated or other previously unac- cessible regions, this work will provide the NGS community with (a) a faster and more accurate tool for mapping short sequence reads, (b) a general methodology for expanding genomic accessibility of NGS, and (c) a versatile, modular, open-source toolbox of algorithms for NGS data analysis, (d) a comprehensive analysis of protein-DNA interactions in repeat regions in all publicly available ChIP-seq datasets. This work is a close collaboration between computer scientists and web-lab biologists who are developing NGS assays to study biomedical problems. In particular, we will collaborate with Timothy Osborne of Sanford- Burnham Medical Research Institute to study regulators involved in cholesterol and fatty acid metabolism, with Kyoko Yokomori of UC Irvine to study Cohesin, Nipbl and their roles in Cornelia de Lange syndrome, and Ken Cho of UC Irvine to study the roles of FoxH1 and Schnurri in development and growth control.

Public Health Relevance

DNA-sequencing has become an indispensable tool for basic biomedical research as well as for discovering new treatments and helping biomedical researchers understand disease mechanisms. Next-generation sequencing, which enables rapid generation of billions of bases at relatively low cost, poses a significant computational challenge on how to analyze the large amount of sequence data efficiently and accurately. The goal of this research is to develop open-source software to improve both the efficiency and accuracy of the next-generation sequencing analysis tools, and thereby allowing biomedical researchers to take full advantage of next-generation sequencing to study biology and disease.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG006870-01
Application #
8350385
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Bonazzi, Vivien
Project Start
2012-08-01
Project End
2015-06-30
Budget Start
2012-08-01
Budget End
2013-06-30
Support Year
1
Fiscal Year
2012
Total Cost
$220,000
Indirect Cost
$66,923
Name
University of California Irvine
Department
Biostatistics & Other Math Sci
Type
Other Domestic Higher Education
DUNS #
046705849
City
Irvine
State
CA
Country
United States
Zip Code
92697
Forouzmand, Elmira; Owens, Nick D L; Blitz, Ira L et al. (2017) Developmentally regulated long non-coding RNAs in Xenopus tropicalis. Dev Biol 426:401-408
Quang, Daniel; Xie, Xiaohui (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44:e107
Weng, Lingjie; Li, Yi; Xie, Xiaohui et al. (2016) Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation. RNA 22:813-21
Quang, Daniel; Chen, Yifei; Xie, Xiaohui (2015) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31:761-3
Li, Yi; Xie, Xiaohui (2015) MixClone: a mixture model for inferring tumor subclonal populations. BMC Genomics 16 Suppl 2:S1
Quang, Daniel X; Erdos, Michael R; Parker, Stephen C J et al. (2015) Motif signatures in stretch enhancers are enriched for disease-associated genetic variants. Epigenetics Chromatin 8:23
Lackford, Brad; Yao, Chengguo; Charles, Georgette M et al. (2014) Fip1 regulates mRNA alternative polyadenylation to promote stem cell self-renewal. EMBO J 33:878-89
Quang, Daniel; Xie, Xiaohui (2014) EXTREME: an online EM algorithm for motif discovery. Bioinformatics 30:1667-73
Chiu, William T; Charney Le, Rebekah; Blitz, Ira L et al. (2014) Genome-wide view of TGF?/Foxh1 regulation of the early mesendoderm program. Development 141:4537-47
Li, Yi; Xie, Xiaohui (2014) Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. Bioinformatics 30:2121-9

Showing the most recent 10 out of 15 publications