Genome Projects and similar efforts are generating vast amounts of data, but many challenges still remain in sequencing and mappng of large genomes. Analysis of these data is also still in its infancy, and refined tools and methodologies are sorely needed. This project aims to provide mathematical and computational improvements in response to several key challenges in the assembly and analysis of genomic data, loosely categorized into mapping and sequencing, regulatory analysis, and discovery of sequence motifs.

Although algorithms for physical mapping and sequencing have been studied for years, there are still no fully satisfactory practical solutions to support current mapping and sequencing efforts as they move to newer technologies, longer and more repeat-rich targets, and dramatically higher throughputs. This project will algorithms and software for assembly of physical maps, with particular emphasis on increasing robustness and decreasing the need for human intervention. It also will investigate algorithms and analysis for novel sequencing strategies that have the potential to reduce cost and increase throughput, accuracy, and speed of sequencing and resequencing.

One of the central goals in molecular biology and genetics is to elucidate the function of genes. This goal is going to become more dominant as more raw genomic information accumulates. By observing the level of expression of a gene in different phases in the organism life cycle, different tissues, or different disease stages, one can get important clues to understanding the gene's role, with significant potential diagnostic and therapeutic applications. This project will provide analytical and algorithmic tools for several problems arising in gene expression analysis. In many cases, one cannot elucidate the functions of individual genes without attacking the more general problem of understanding regulatory pathways--systems of interacting genes and proteins controlling fundamental cellular processes. Again, the technology for gathering relevant data currently far outstrips the capabilities of computational tools for their analysis. This project will study algorithms for designing and interpreting expression array experiments involving gene knockouts and other perturbations that, in concert with a priori biological knowledge, allow putative pathways to be verified (or refuted) and perhaps even inferred.

Another area where data gathering capabilities outstrip those of our analytical tools is sequence analysis. A recurrent approach in sequence analysis is identification of sequence motifs--approximately repeated patterns in DNA or protein sequence data. Such similarities are often the key to identifying functionally important features such as protein binding sites on DNA sequences. As one example of the potential utility of such tools, they might help bridge an important gap in regulatory studies such as those outlined above. When analysis of expression array data reveals large sets of coregulated genes, a natural next step is to look for common motifs in regions near these genes, potentially binding sites for regulatory proteins. Identification of these proteins would then suggest dependencies in the regulatory pathways for the genes under study. This project will study key problems related to finding motifs and assessing their statistical significance, including subtle sequence signals involving long patterns with inserted and deleted residues.

National Science Foundation (NSF)
Division of Biological Infrastructure (DBI)
Application #
Program Officer
Gerald F. Guala
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
United States
Zip Code