Machine learning methods to increase genomic accessibility by next-gen sequencing

Xie, Xiaohui

Abstract

DNA sequencing has become an indispensable tool in many areas of biology and medicine. Recent techno- logical breakthroughs in next-generation sequencing (NGS) have made it possible to sequence billions of bases quickly and cheaply. A number of NGS-based tools have been created, including ChIP-seq, RNA-seq, Methyl- seq and exon/whole-genome sequencing, enabling a fundamentally new way of studying diseases, genomes and epigenomes. The widespread use of NGS-based methods calls for better and more efficient tools for the analysis and interpretation of the NGS high-throughput data. Although a number of computational tools have been devel- oped, they are insufficient in mapping and studying genome features located within repeat, duplicated and other so-called unmappable regions of genomes. In this project, computational algorithms and software that expand genomic accessibility of NGS to these previously understudied regions will be developed. The algorithms will begin with a new way of mapping raw reads from NGS to the reference genome, followed by a machine learning method to resolve ambiguously mapped reads, and will be integrated into a comprehen- sive analysis pipeline for ChIP-seq. More specifically, the three aims of the research are to develop: (1) Data structures and efficient algorithms for read mapping to rapidly identify all mapping locations. Unlike existing methods, the focus of this research is to rapidly identify all candidate locations of each read, instead of one or only a few locations. (2) Machine learning algorithms for read analysis to resolve ambiguously mapped reads for both ChIP-seq analysis and genetic variation detection. This work will develop probabilistic models to resolve ambiguously mapped reads by pooling information from the entire collection of reads. (3) A comprehensive ChIP- seq analysis pipeline to systematically study genomic features located within unmappable regions of genomes. These algorithms will be tested and refined using both publicly available data and data from established wet-lab collaborators. In addition to discovering new genomic features located within repeat, duplicated or other previously unac- cessible regions, this work will provide the NGS community with (a) a faster and more accurate tool for mapping short sequence reads, (b) a general methodology for expanding genomic accessibility of NGS, and (c) a versatile, modular, open-source toolbox of algorithms for NGS data analysis, (d) a comprehensive analysis of protein-DNA interactions in repeat regions in all publicly available ChIP-seq datasets. This work is a close collaboration between computer scientists and web-lab biologists who are developing NGS assays to study biomedical problems. In particular, we will collaborate with Timothy Osborne of Sanford- Burnham Medical Research Institute to study regulators involved in cholesterol and fatty acid metabolism, with Kyoko Yokomori of UC Irvine to study Cohesin, Nipbl and their roles in Cornelia de Lange syndrome, and Ken Cho of UC Irvine to study the roles of FoxH1 and Schnurri in development and growth control.

Public Health Relevance

DNA-sequencing has become an indispensable tool for basic biomedical research as well as for discovering new treatments and helping biomedical researchers understand disease mechanisms. Next-generation sequencing, which enables rapid generation of billions of bases at relatively low cost, poses a significant computational challenge on how to analyze the large amount of sequence data efficiently and accurately. The goal of this research is to develop open-source software to improve both the efficiency and accuracy of the next-generation sequencing analysis tools, and thereby allowing biomedical researchers to take full advantage of next-generation sequencing to study biology and disease.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG006870-02
Application #: 8518436
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Bonazzi, Vivien

Project Start: 2012-08-01
Project End: 2015-06-30
Budget Start: 2013-07-01
Budget End: 2014-06-30
Support Year: 2
Fiscal Year: 2013
Total Cost: $220,626
Indirect Cost: $67,548

Institution

Name: University of California Irvine
Department: Biostatistics & Other Math Sci
Type: Other Domestic Higher Education
DUNS #: 046705849

City: Irvine
State: CA
Country: United States
Zip Code: 92697

Related projects


NIH 2014 R01 HG	Machine learning methods to increase genomic accessibility by next-gen sequencing Xie, Xiaohui / University of California Irvine
NIH 2013 R01 HG	Machine learning methods to increase genomic accessibility by next-gen sequencing Xie, Xiaohui / University of California Irvine	$220,626
NIH 2012 R01 HG	Machine learning methods to increase genomic accessibility by next-gen sequencing Xie, Xiaohui / University of California Irvine	$220,000

Publications

Forouzmand, Elmira; Owens, Nick D L; Blitz, Ira L et al. (2017) Developmentally regulated long non-coding RNAs in Xenopus tropicalis. Dev Biol 426:401-408

Quang, Daniel; Xie, Xiaohui (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 44:e107

Weng, Lingjie; Li, Yi; Xie, Xiaohui et al. (2016) Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation. RNA 22:813-21

Quang, Daniel; Chen, Yifei; Xie, Xiaohui (2015) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31:761-3

Li, Yi; Xie, Xiaohui (2015) MixClone: a mixture model for inferring tumor subclonal populations. BMC Genomics 16 Suppl 2:S1

Quang, Daniel X; Erdos, Michael R; Parker, Stephen C J et al. (2015) Motif signatures in stretch enhancers are enriched for disease-associated genetic variants. Epigenetics Chromatin 8:23

Li, Yi; Xie, Xiaohui (2014) Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. Bioinformatics 30:2121-9

Watanabe, Kazuhide; Biesinger, Jacob; Salmans, Michael L et al. (2014) Integrative ChIP-seq/microarray analysis identifies a CTNNB1 target signature enriched in intestinal stem cells and colon cancer. PLoS One 9:e92317

Kim, Jongik; Li, Chen; Xie, Xiaohui (2014) Improving read mapping using additional prefix grams. BMC Bioinformatics 15:42

Lackford, Brad; Yao, Chengguo; Charles, Georgette M et al. (2014) Fip1 regulates mRNA alternative polyadenylation to promote stem cell self-renewal. EMBO J 33:878-89

Showing the most recent 10 out of 15 publications

Comments

Be the first to comment on Xiaohui Xie's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: