Improve Genome Annotation Using Multiple Sequence Alignment Reliability Scores

Ma, Jian

Abstract

Comparative genomics is a powerful tool to discover functional elements in the human genome. The foundation of cross-species comparative genomics is multiple sequence alignment (MSA). Despite of the progress in the past decade, MSA is still a difficult task and error-prone. The alignment errors can directly affect the downstream analyses and may lead to incorrect biological conclusions. Many biomedical researchers have been using publicly available, precomputed MSAs in the Ensembl Browser and the UCSC Genome Browser to conduct various comparative genomic analyses. But these MSAs have errors. However, users often do not ask how reliable the alignment is or do not know how to quantitatively measure the reliability. Preliminary study suggests that a considerable amount of conserved elements in the current UCSC Genome Browser might be false positives introduced by unreliable MSAs. The impact of problematic alignment on the genome annotation may be much greater than we thought. In this project, novel probabilistic sampling-based scores to measure multiple sequence alignment will be developed. Context- dependent substitution models and more realistic models to handle insertions and deletions will be employed in order to apply the method to the genome wide scale with the capability of dealing with deep alignments from large number of sequences. In addition, the alignment reliability scores will be used to improve genome annotation. The data of functional elements in the human genome from the ENCODE project will be used to refine the model. The method will also be applied to pick up more functional elements that are originally missed because of the uncertainty in the alignment. Improvement on other types of genome annotations (e.g. RNA gene, positive selection) will also be explored. These new methods that capture MSAs reliability will greatly reduce the false positives in comparative genomics analysis that are introduced by alignment errors. If successful, the general methodology of comparative genomics can be improved and laboratory experiments that rely on computational studies based on MSAs will be much more effective. Results from the project will be integrated into the UCSC Genome Browser to benefit other researchers who use MSAs for various biomedical discoveries for disease related signatures. The method will potentially have meaningful impact on ENCODE, TCGA, Genome 10K, and other large-scale comparative genomics projects. This innovative project in computational biology will potentially have important impact on the genomics community and enable advancement in biomedical research.

Public Health Relevance

This project will develop novel methods for analyzing multiple sequence alignments in order to enhance our ability to annotate functional elements in the human genome. The proposed methods will help us better understand the human genome and facilitate the discovery of disease related signatures. This computational biology project will support the rapid advancement in genomics and biomedical research.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Exploratory/Developmental Grants (R21)
Project #: 1R21HG006464-01
Application #: 8229724
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Good, Peter J

Project Start: 2012-02-22
Project End: 2014-01-31
Budget Start: 2012-02-22
Budget End: 2013-01-31
Support Year: 1
Fiscal Year: 2012
Total Cost: $191,336
Indirect Cost: $66,336

Institution

Name: University of Illinois Urbana-Champaign
Department: Engineering (All Types)
Type: Schools of Engineering
DUNS #: 041544081

City: Champaign
State: IL
Country: United States
Zip Code: 61820

Related projects


NIH 2013 R21 HG	Improve Genome Annotation Using Multiple Sequence Alignment Reliability Scores Ma, Jian / University of Illinois Urbana-Champaign	$191,200
NIH 2012 R21 HG	Improve Genome Annotation Using Multiple Sequence Alignment Reliability Scores Ma, Jian / University of Illinois Urbana-Champaign	$191,336

Publications

Yokoyama, Ken Daigoro; Zhang, Yang; Ma, Jian (2014) Tracing the evolution of lineage-specific transcription factor binding sites in a birth-death framework. PLoS Comput Biol 10:e1003771

Kim, Jaebum; Ma, Jian (2014) PSAR-align: improving multiple sequence alignment using probabilistic sampling. Bioinformatics 30:1010-2

Burns, Paul D; Li, Yang; Ma, Jian et al. (2014) UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 42:e25

Earl, Dent; Nguyen, Ngan; Hickey, Glenn et al. (2014) Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res 24:2077-89

Li-Byarlay, Hongmei; Li, Yang; Stroud, Hume et al. (2013) RNA interference knockdown of DNA methyl-transferase 3 affects gene alternative splicing in the honey bee. Proc Natl Acad Sci U S A 110:12750-5

Li, Yang; Li-Byarlay, Hongmei; Burns, Paul et al. (2013) TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 41:e51

Kim, Jaebum; Larkin, Denis M; Cai, Qingle et al. (2013) Reference-assisted chromosome assembly. Proc Natl Acad Sci U S A 110:1785-90

Wu, Xiao-Long; Heo, Yun; El Hajj, Izzat et al. (2012) TIGER: tiled iterative genome assembler. BMC Bioinformatics 13 Suppl 19:S18

Qiu, Qiang; Zhang, Guojie; Ma, Tao et al. (2012) The yak genome and adaptation to life at high altitude. Nat Genet 44:946-9

Groenen, Martien A M; Archibald, Alan L; Uenishi, Hirohide et al. (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491:393-8

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: