Whole genome sequencing creates numerous opportunities for comparative analysis of different organisms elucidating the molds of conservation as well as patterns of divergence that lead to species diversification, robustness, fitness, and taxonomical organization. In particular, selective evolutionary forces create variable rate of conservation on different functional sites thereby producing distinctive comparative signatures in different genomic regions. These signatures can be exploited by computational methods for an improved detection of functionally important regions such as protein-coding exons, RNA genes, promoters, 3'UTR regions and other yet unexpected features. The exact identification of genes in the Human Genome remains a challenge as the number of predicted genes was significantly lower than previous estimates indicated, and the actual predictions appear to disagree tremendously and vary dramatically based on the specific gene finding methodology deployed. Since the pattern of conservation in different functional regions of the genome, a comparative computational analysis can lead, in principle, to a significantly improved computational identification of genes in the Human genome by using a reference genome such as mouse genome. However, this comparative methodology critically depend on three important factors: 1) The selection of comparative features that provide the most accurate signatures that can be used in comparative gene recognition? 2) The most appropriate selection of the reference genome at the right evolutionary distance from the Human genome to provide sufficiently distinctive patterns conservation in different regions to aid better gene recognition? 3) The selection of the specific gene recognition architecture that is most effective in interpreting the comparative signatures? In this proposal we develop a general computational framework for comparative analysis of genomic sequences focusing on achieving a substantial improvement in gene recognition accuracy. We propose a specific architecture for a comparative computational gene recognition system based on evidence integration frameworks. Based on this architecture we propose to develop a modular and highly portable system for comparative sequence analysis that we plan to use for mouse-human sequence analysis as well as new related genomes soon to be sequenced including generating an improved annotation of the Drosophila sequence using related genomes. ? ?

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Exploratory/Developmental Grants Phase II (R33)
Project #
5R33HG002850-03
Application #
7120158
Study Section
Special Emphasis Panel (ZRG1-SSS-Y (11))
Program Officer
Good, Peter J
Project Start
2004-09-24
Project End
2009-08-31
Budget Start
2006-09-01
Budget End
2009-08-31
Support Year
3
Fiscal Year
2006
Total Cost
$341,773
Indirect Cost
Name
Boston University
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
049435266
City
Boston
State
MA
Country
United States
Zip Code
02215
Dotan-Cohen, Dikla; Letovsky, Stan; Melkman, Avraham A et al. (2009) Biological process linkage networks. PLoS One 4:e5313
Molla, Michael; Delcher, Arthur; Sunyaev, Shamil et al. (2009) Triplet repeat length bias and variation in the human transcriptome. Proc Natl Acad Sci U S A 106:17095-100
Dotan-Cohen, Dikla; Melkman, Avraham A; Kasif, Simon (2007) Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics 23:3335-42
Zhang, Lingang; Kasif, Simon; Cantor, And Charles R (2007) Quantifying DNA-protein binding specificities by using oligonucleotide mass tags and mass spectroscopy. Proc Natl Acad Sci U S A 104:3061-6
Alon, Noga; Asodi, Vera; Cantor, Charles et al. (2006) Multi-node graphs: a framework for multiplexed biological assays. J Comput Biol 13:1659-72
Rachlin, John; Cohen, Dikla Dotan; Cantor, Charles et al. (2006) Biological context networks: a mosaic view of the interactome. Mol Syst Biol 2:66
Zheng, Yu; Anton, Brian P; Roberts, Richard J et al. (2005) Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 6:243
Wu, Chang-Jiun; Kasif, Simon (2005) GEMS: a web server for biclustering analysis of expression data. Nucleic Acids Res 33:W596-9
Rachlin, John; Ding, Chunming; Cantor, Charles et al. (2005) MuPlex: multi-objective multiplex PCR assay design. Nucleic Acids Res 33:W544-7
Lee, Soohyun; Kohane, Isaac; Kasif, Simon (2005) Genes involved in complex adaptive processes tend to have highly conserved upstream regions in mammalian genomes. BMC Genomics 6:168

Showing the most recent 10 out of 11 publications