Comparative genomics is a powerful tool to discover functional elements in the human genome. The foundation of cross-species comparative genomics is multiple sequence alignment (MSA). Despite of the progress in the past decade, MSA is still a difficult task and error-prone. The alignment errors can directly affect the downstream analyses and may lead to incorrect biological conclusions. Many biomedical researchers have been using publicly available, precomputed MSAs in the Ensembl Browser and the UCSC Genome Browser to conduct various comparative genomic analyses. But these MSAs have errors. However, users often do not ask how reliable the alignment is or do not know how to quantitatively measure the reliability. Preliminary study suggests that a considerable amount of conserved elements in the current UCSC Genome Browser might be false positives introduced by unreliable MSAs. The impact of problematic alignment on the genome annotation may be much greater than we thought. In this project, novel probabilistic sampling-based scores to measure multiple sequence alignment will be developed. Context- dependent substitution models and more realistic models to handle insertions and deletions will be employed in order to apply the method to the genome wide scale with the capability of dealing with deep alignments from large number of sequences. In addition, the alignment reliability scores will be used to improve genome annotation. The data of functional elements in the human genome from the ENCODE project will be used to refine the model. The method will also be applied to pick up more functional elements that are originally missed because of the uncertainty in the alignment. Improvement on other types of genome annotations (e.g. RNA gene, positive selection) will also be explored. These new methods that capture MSAs reliability will greatly reduce the false positives in comparative genomics analysis that are introduced by alignment errors. If successful, the general methodology of comparative genomics can be improved and laboratory experiments that rely on computational studies based on MSAs will be much more effective. Results from the project will be integrated into the UCSC Genome Browser to benefit other researchers who use MSAs for various biomedical discoveries for disease related signatures. The method will potentially have meaningful impact on ENCODE, TCGA, Genome 10K, and other large-scale comparative genomics projects. This innovative project in computational biology will potentially have important impact on the genomics community and enable advancement in biomedical research.

Public Health Relevance

This project will develop novel methods for analyzing multiple sequence alignments in order to enhance our ability to annotate functional elements in the human genome. The proposed methods will help us better understand the human genome and facilitate the discovery of disease related signatures. This computational biology project will support the rapid advancement in genomics and biomedical research.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Good, Peter J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Illinois Urbana-Champaign
Engineering (All Types)
Schools of Engineering
United States
Zip Code
Yokoyama, Ken Daigoro; Zhang, Yang; Ma, Jian (2014) Tracing the evolution of lineage-specific transcription factor binding sites in a birth-death framework. PLoS Comput Biol 10:e1003771
Kim, Jaebum; Ma, Jian (2014) PSAR-align: improving multiple sequence alignment using probabilistic sampling. Bioinformatics 30:1010-2
Earl, Dent; Nguyen, Ngan; Hickey, Glenn et al. (2014) Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res 24:2077-89
Burns, Paul D; Li, Yang; Ma, Jian et al. (2014) UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 42:e25
Li, Yang; Li-Byarlay, Hongmei; Burns, Paul et al. (2013) TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 41:e51
Li-Byarlay, Hongmei; Li, Yang; Stroud, Hume et al. (2013) RNA interference knockdown of DNA methyl-transferase 3 affects gene alternative splicing in the honey bee. Proc Natl Acad Sci U S A 110:12750-5
Kim, Jaebum; Larkin, Denis M; Cai, Qingle et al. (2013) Reference-assisted chromosome assembly. Proc Natl Acad Sci U S A 110:1785-90
Qiu, Qiang; Zhang, Guojie; Ma, Tao et al. (2012) The yak genome and adaptation to life at high altitude. Nat Genet 44:946-9
Groenen, Martien A M; Archibald, Alan L; Uenishi, Hirohide et al. (2012) Analyses of pig genomes provide insight into porcine demography and evolution. Nature 491:393-8
Wu, Xiao-Long; Heo, Yun; El Hajj, Izzat et al. (2012) TIGER: tiled iterative genome assembler. BMC Bioinformatics 13 Suppl 19:S18