This project will develop new data structures and software for computational biology and big data storage systems. The data structures created in this project will allow computational biology and big data applications to maintain compact summaries of huge data sets. Because the summaries are small, they can be stored in a computer's fast memory, enabling applications to run much more quickly and to scale to larger data sets. For example, this project will develop a tool for searching through genetic information for thousands (to millions) of individuals to detect genetic variations that are correlated with disease or other traits.

A major challenge that this project will address is that applications need compact, feature-rich summary data structures. Applications need summaries that can represent a set of elements, count duplicates in a set of input data, be resized as the data set grows, support deletions of items, be merged with other summaries, and support high concurrency on today's multi-core systems. However, current summary data structures offer limited features. As a result, today's applications must design around these limitations, resulting in software that is slower, uses more memory, and is more complex than necessary.

The project will impact core computer science applications, such as databases and file systems, and medical and biological applications, such as genome and transcriptome analysis. Databases and file systems will run faster and use less memory. They will be able to combine fast, expensive solid-state storage devices with cheap, slow, but capacious hard drives to get the best of both devices: low cost and high performance. Biologists will be able to analyze sequencing data more quickly and cheaply, using fewer computational resources. They will be able to search through huge datasets to make new discoveries.

All papers, documentation, and software created by this project will be released as open source, typically on popular open-source development websites, such as GitHub, under the COMBINE-lab (https://github.com/COMBINE-lab) and splatlab (https://github.com/splatlab) organizations. Papers will be hosted by the publishers, as well as on the author?s personal websites.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1763680
Program Officer
Erik Brunvand
Project Start
Project End
Budget Start
2018-08-15
Budget End
2022-07-31
Support Year
Fiscal Year
2017
Total Cost
$889,876
Indirect Cost
Name
State University New York Stony Brook
Department
Type
DUNS #
City
Stony Brook
State
NY
Country
United States
Zip Code
11794