The recent emergence of a variety of high-throughput DNA sequencing instrumentation, and the concomitant rapid decline in the cost per base, is causing severe data deluge in all areas of life sciences. The heterogeneity of sequencing instrumentation and the vast diversity of applications enabled by them are creating numerous analytics problems for the bioinformatics community to address. In addition, the conventional serial algorithms that have been the mainstay of bioinformatics research are severely challenged by the ever increasing data sets. The goal of the proposed project is to develop core techniques and software libraries to enable scalable, efficient, high performance computing solutions for high-throughput DNA sequencing, also known as next-generation sequencing (NGS). To empower the larger community, the project seeks to 1) identify a set of core functionalities that frequently occur in many types of high-throughput sequencing applications, 2) develop efficient parallel algorithms and high performance implementations for them, 3) pursue mapping to HPC architectures including clusters, multicores, and GPUs, 4) develop software libraries encapsulating these functionalities with the goal of enabling the bioinformatics community to exploit HPC architectures, and 5) design a domain specific language to enable bioinformatics researchers unfamiliar with parallel processing to benefit from this work through automatic generation of parallel codes. The research will be conducted in the context of challenging problems in human genetics and metagenomics, in collaboration with domain specialists.
This project is focused on a key capacity building activity to facilitate pervasive use of parallelism by NGS bioinformatics researchers and practitioners. The goal is to empower the broader community to benefit from clever parallel algorithms, highly tuned implementations, and specialized HPC hardware, without requiring expertise in any of these. The software libraries will be released as open source for use, further development, enhancements, and incorporation by the community. The project will provide opportunities for training postdoctoral and graduate students in bigdata analytics and computer science driven interdisciplinary research. Diverse existing mechanisms at the partner institutions will be leveraged to advance goals of minority and women recruitment, undergraduate participation in research, and K-12 outreach.