Comparative genomics is a powerful tool to discover functional elements in the human genome. The foundation of cross-species comparative genomics is multiple sequence alignment (MSA). Despite of the progress in the past decade, MSA is still a difficult task and error-prone. The alignment errors can directly affect the downstream analyses and may lead to incorrect biological conclusions. Many biomedical researchers have been using publicly available, precomputed MSAs in the Ensembl Browser and the UCSC Genome Browser to conduct various comparative genomic analyses. But these MSAs have errors. However, users often do not ask how reliable the alignment is or do not know how to quantitatively measure the reliability. Preliminary study suggests that a considerable amount of conserved elements in the current UCSC Genome Browser might be false positives introduced by unreliable MSAs. The impact of problematic alignment on the genome annotation may be much greater than we thought. In this project, novel probabilistic sampling-based scores to measure multiple sequence alignment will be developed. Context- dependent substitution models and more realistic models to handle insertions and deletions will be employed in order to apply the method to the genome wide scale with the capability of dealing with deep alignments from large number of sequences. In addition, the alignment reliability scores will be used to improve genome annotation. The data of functional elements in the human genome from the ENCODE project will be used to refine the model. The method will also be applied to pick up more functional elements that are originally missed because of the uncertainty in the alignment. Improvement on other types of genome annotations (e.g. RNA gene, positive selection) will also be explored. These new methods that capture MSAs reliability will greatly reduce the false positives in comparative genomics analysis that are introduced by alignment errors. If successful, the general methodology of comparative genomics can be improved and laboratory experiments that rely on computational studies based on MSAs will be much more effective. Results from the project will be integrated into the UCSC Genome Browser to benefit other researchers who use MSAs for various biomedical discoveries for disease related signatures. The method will potentially have meaningful impact on ENCODE, TCGA, Genome 10K, and other large-scale comparative genomics projects. This innovative project in computational biology will potentially have important impact on the genomics community and enable advancement in biomedical research.
This project will develop novel methods for analyzing multiple sequence alignments in order to enhance our ability to annotate functional elements in the human genome. The proposed methods will help us better understand the human genome and facilitate the discovery of disease related signatures. This computational biology project will support the rapid advancement in genomics and biomedical research.
|Yokoyama, Ken Daigoro; Zhang, Yang; Ma, Jian (2014) Tracing the evolution of lineage-specific transcription factor binding sites in a birth-death framework. PLoS Comput Biol 10:e1003771|
|Kim, Jaebum; Ma, Jian (2014) PSAR-align: improving multiple sequence alignment using probabilistic sampling. Bioinformatics 30:1010-2|
|Burns, Paul D; Li, Yang; Ma, Jian et al. (2014) UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 42:e25|
|Li-Byarlay, Hongmei; Li, Yang; Stroud, Hume et al. (2013) RNA interference knockdown of DNA methyl-transferase 3 affects gene alternative splicing in the honey bee. Proc Natl Acad Sci U S A 110:12750-5|
|Kim, Jaebum; Larkin, Denis M; Cai, Qingle et al. (2013) Reference-assisted chromosome assembly. Proc Natl Acad Sci U S A 110:1785-90|
|Li, Yang; Li-Byarlay, Hongmei; Burns, Paul et al. (2013) TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 41:e51|