The recent emergence of a variety of high-throughput DNA sequencing instrumentation, and the concomitant rapid decline in the cost per base, is causing severe data deluge in all areas of life sciences. The heterogeneity of sequencing instrumentation and the vast diversity of applications enabled by them are creating numerous analytics problems for the bioinformatics community to address. In addition, the conventional serial algorithms that have been the mainstay of bioinformatics research are severely challenged by the ever increasing data sets. The goal of the proposed project is to develop core techniques and software libraries to enable scalable, efficient, high performance computing solutions for high-throughput DNA sequencing, also known as next-generation sequencing (NGS). To empower the larger community, the project seeks to 1) identify a set of core functionalities that frequently occur in many types of high-throughput sequencing applications, 2) develop efficient parallel algorithms and high performance implementations for them, 3) pursue mapping to HPC architectures including clusters, multicores, and GPUs, 4) develop software libraries encapsulating these functionalities with the goal of enabling the bioinformatics community to exploit HPC architectures, and 5) design a domain specific language to enable bioinformatics researchers unfamiliar with parallel processing to benefit from this work through automatic generation of parallel codes. The research will be conducted in the context of challenging problems in human genetics and metagenomics, in collaboration with domain specialists.

This project is focused on a key capacity building activity to facilitate pervasive use of parallelism by NGS bioinformatics researchers and practitioners. The goal is to empower the broader community to benefit from clever parallel algorithms, highly tuned implementations, and specialized HPC hardware, without requiring expertise in any of these. The software libraries will be released as open source for use, further development, enhancements, and incorporation by the community. The project will provide opportunities for training postdoctoral and graduate students in bigdata analytics and computer science driven interdisciplinary research. Diverse existing mechanisms at the partner institutions will be leveraged to advance goals of minority and women recruitment, undergraduate participation in research, and K-12 outreach.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1416259
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-08-31
Budget End
2018-12-31
Support Year
Fiscal Year
2014
Total Cost
$1,285,507
Indirect Cost
Name
Georgia Tech Research Corporation
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30332