DNA sequencing has become an indispensable tool in many areas of biology and medicine. Recent techno- logical breakthroughs in next-generation sequencing (NGS) have made it possible to sequence billions of bases quickly and cheaply. A number of NGS-based tools have been created, including ChIP-seq, RNA-seq, Methyl- seq and exon/whole-genome sequencing, enabling a fundamentally new way of studying diseases, genomes and epigenomes. The widespread use of NGS-based methods calls for better and more efficient tools for the analysis and interpretation of the NGS high-throughput data. Although a number of computational tools have been devel- oped, they are insufficient in mapping and studying genome features located within repeat, duplicated and other so-called unmappable regions of genomes. In this project, computational algorithms and software that expand genomic accessibility of NGS to these previously understudied regions will be developed. The algorithms will begin with a new way of mapping raw reads from NGS to the reference genome, followed by a machine learning method to resolve ambiguously mapped reads, and will be integrated into a comprehen- sive analysis pipeline for ChIP-seq. More specifically, the three aims of the research are to develop: (1) Data structures and efficient algorithms for read mapping to rapidly identify all mapping locations. Unlike existing methods, the focus of this research is to rapidly identify all candidate locations of each read, instead of one or only a few locations. (2) Machine learning algorithms for read analysis to resolve ambiguously mapped reads for both ChIP-seq analysis and genetic variation detection. This work will develop probabilistic models to resolve ambiguously mapped reads by pooling information from the entire collection of reads. (3) A comprehensive ChIP- seq analysis pipeline to systematically study genomic features located within unmappable regions of genomes. These algorithms will be tested and refined using both publicly available data and data from established wet-lab collaborators. In addition to discovering new genomic features located within repeat, duplicated or other previously unac- cessible regions, this work will provide the NGS community with (a) a faster and more accurate tool for mapping short sequence reads, (b) a general methodology for expanding genomic accessibility of NGS, and (c) a versatile, modular, open-source toolbox of algorithms for NGS data analysis, (d) a comprehensive analysis of protein-DNA interactions in repeat regions in all publicly available ChIP-seq datasets. This work is a close collaboration between computer scientists and web-lab biologists who are developing NGS assays to study biomedical problems. In particular, we will collaborate with Timothy Osborne of Sanford- Burnham Medical Research Institute to study regulators involved in cholesterol and fatty acid metabolism, with Kyoko Yokomori of UC Irvine to study Cohesin, Nipbl and their roles in Cornelia de Lange syndrome, and Ken Cho of UC Irvine to study the roles of FoxH1 and Schnurri in development and growth control.

Public Health Relevance

DNA-sequencing has become an indispensable tool for basic biomedical research as well as for discovering new treatments and helping biomedical researchers understand disease mechanisms. Next-generation sequencing, which enables rapid generation of billions of bases at relatively low cost, poses a significant computational challenge on how to analyze the large amount of sequence data efficiently and accurately. The goal of this research is to develop open-source software to improve both the efficiency and accuracy of the next-generation sequencing analysis tools, and thereby allowing biomedical researchers to take full advantage of next-generation sequencing to study biology and disease.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Irvine
Biostatistics & Other Math Sci
Other Domestic Higher Education
United States
Zip Code
Quang, Daniel; Chen, Yifei; Xie, Xiaohui (2015) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31:761-3
Kim, Jongik; Li, Chen; Xie, Xiaohui (2014) Improving read mapping using additional prefix grams. BMC Bioinformatics 15:42
Li, Yi; Xie, Xiaohui (2014) Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. Bioinformatics 30:2121-9
Quang, Daniel; Xie, Xiaohui (2014) EXTREME: an online EM algorithm for motif discovery. Bioinformatics 30:1667-73
Lackford, Brad; Yao, Chengguo; Charles, Georgette M et al. (2014) Fip1 regulates mRNA alternative polyadenylation to promote stem cell self-renewal. EMBO J 33:878-89
Li, Yi; Xie, Xiaohui (2013) A mixture model for expression deconvolution from RNA-seq in heterogeneous tissues. BMC Bioinformatics 14 Suppl 5:S11