New generation DNA sequencing technologies are revolutionizing modern biological research. Scientists can now generate the rough equivalent of an entire human genome (~3 billion base-pairs of DNA) in just a few days with one single sequencing instrument. Until recently, such amounts of data could only be generated at large genome centers using hundreds of sequencers. The analysis of these data is complicated by their size - a single run of a sequencing instrument yields terabytes of information, often requiring a significant scale-up of the existing computational infrastructure. This project is developing parallel algorithms for analyzing new generation sequencing data with a specific focus on the Map-Reduce paradigm implemented on a highly-distributed computing cluster supported by Google and IBM. The project is primarily focused on developing algorithms for sequence alignment and sequence assembly ? critical tasks in the analysis of genomic data ? and involves the adaptation of string matching and graph algorithms to the Map-Reduce paradigm.

This work will potentially lead to parallelism-enabled genomic analysis software that will allow researchers to analyze new generation sequencing data through web-scale computational resources, thereby obviating the need for establishing and maintaining a local high-performance computing infrastructure. The software developed during this project is being made available under an open-source license in order to encourage broad use and to enable future research. The research is integrated with teaching and mentoring of graduate and undergraduate students and the results of the work will be disseminated through journal publications and conference presentations.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0844494
Program Officer
Vasant G. Honavar
Project Start
Project End
Budget Start
2009-04-01
Budget End
2013-03-31
Support Year
Fiscal Year
2008
Total Cost
$409,919
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742