Next-generation sequencing has transformed genomics into a new paradigm of data-intensive computing, raising several salient challenges. First, the deluge of genomic data needs to undergo deep analysis to mine biological information, which requires a full pipeline that integrates many data processing and analysis tools. Second, deep analysis pipelines often take long to run, which entails a long cycle for algorithm and method development. This project aims to bring the latest big data technology and database technology to the genomics domain to revolutionize its data crunching power. This project is anticipated to produce significant scientific and educational benefits. By providing a highly-optimized parallel processing platform for genomic data analysis and making it accessible in private and public clouds, it will enable many new models and algorithms to be developed for genomics and help advance this field at unprecedented speed as big data technology did for Internet companies. This project also integrates research and education with curriculum development, tutorials for K-12 teachers and community college faculty, and engaging women in research through college outreach and NSF-funded outreach programs.

The proposed research includes the development of (1) a deep pipeline for genomic data analysis by assembling state-of-the-art methods, (2) automatic parallelization of the workflow using the big data technology, (3) a principled approach to optimizing the genomic pipeline, and (4) integration of streaming technology to reduce latency of important results. The prototype system will be deployed in both private and public cloud environments, and fully evaluated using existing long-running pipelines at the New York Genome Center and in a variety of real use cases. By way of doing so, this project will provide new knowledge regarding how to adapt and advance big data technology, including new optimization, partitioning, and scheduling techniques, for the genomics domain. The results of the project are disseminated at the web site: http://gesall.cs.umass.edu.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1356469
Program Officer
Jennifer Weller
Project Start
Project End
Budget Start
2014-09-15
Budget End
2018-02-28
Support Year
Fiscal Year
2013
Total Cost
$345,581
Indirect Cost
Name
New York Genome Center
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10013