CSR: Medium: Approximate Membership Query Data Structures in Computational Biology and Storage

Patro, Robert; Bender, Michael; Ferdman, Michael

Abstract

This project will develop new data structures and software for computational biology and big data storage systems. The data structures created in this project will allow computational biology and big data applications to maintain compact summaries of huge data sets. Because the summaries are small, they can be stored in a computer's fast memory, enabling applications to run much more quickly and to scale to larger data sets. For example, this project will develop a tool for searching through genetic information for thousands (to millions) of individuals to detect genetic variations that are correlated with disease or other traits.

A major challenge that this project will address is that applications need compact, feature-rich summary data structures. Applications need summaries that can represent a set of elements, count duplicates in a set of input data, be resized as the data set grows, support deletions of items, be merged with other summaries, and support high concurrency on today's multi-core systems. However, current summary data structures offer limited features. As a result, today's applications must design around these limitations, resulting in software that is slower, uses more memory, and is more complex than necessary.

The project will impact core computer science applications, such as databases and file systems, and medical and biological applications, such as genome and transcriptome analysis. Databases and file systems will run faster and use less memory. They will be able to combine fast, expensive solid-state storage devices with cheap, slow, but capacious hard drives to get the best of both devices: low cost and high performance. Biologists will be able to analyze sequencing data more quickly and cheaply, using fewer computational resources. They will be able to search through huge datasets to make new discoveries.

All papers, documentation, and software created by this project will be released as open source, typically on popular open-source development websites, such as GitHub, under the COMBINE-lab (https://github.com/COMBINE-lab) and splatlab (https://github.com/splatlab) organizations. Papers will be hosted by the publishers, as well as on the author?s personal websites.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Application #: 1763680
Program Officer: Erik Brunvand

Project Start
Project End
Budget Start: 2018-08-15
Budget End: 2022-07-31
Support Year
Fiscal Year: 2017
Total Cost: $889,876
Indirect Cost

CSR: Medium: Approximate Membership Query Data Structures in Computational Biology and Storage
Patro, Robert Bender, Michael Ferdman, Michael
State University New York Stony Brook, Stony Brook, NY, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments