Determining the genomic makeup of individuals is crucial for understanding how certain genomic variants ultimately lead to disease (such as cancer). Determining genomic makeup of agriculturally important plants, trees, farm animals and wild life help improve agriculture, forestry, veterinary medicine and environmental science. Since the introduction of "next generation sequencing technologies" in 2008, the cost of genome sequencing has dropped by a factor of 1000. This has led to an increase in the speed genomic data is generated that far outpaces the improvements in our computing and data storage capability. With the advent of these cheap, and fast genome sequencing technologies, the scientific community has been able to launch mega-projects such as The Pan Cancer Analysis of Whole Genomes Project, which aim to determine the genome sequences of thousands of cancer patients. Our project aims to address the imminent data size challenges in these large scale genomic studies through new genomic data compression methods that aim to reduce the redundancy in how genomic sequences are represented. The source of this redundancy is the high similarity among genome sequences of individual patients, as well as the high similarity between regions across the genome of a single human genome. Since the main difficulty in extracting information from genome sequences is computational, reduction in the computational resources needed to manage and analyze genomic data through the compression methods will help genomics improve human life and the environment.

The impact of this project on student and personnel training will be in terms of two new graduate courses at Indiana University: a course on data management, access and processing for genomic data by PI Sahinalp, and a course on compressed algorithms with a focus on genomic data, emphasizing the effects of new big data paradigms compression, by PI Ergun. Both courses will fit into the CS PhD program, as well as into the existing Bioinformatics and Data Science Master's programs; they are also intended to attract the more curious undergraduates.

The rapid advancement of nucleic acid sequencing technology has re-shaped almost every field of life science, from agriculture to bioenergy, and from environmental science to biomedicine. Large-scale genome projects are producing petabyte-scale data from thousands of patients or by mobile sensors collecting environmental samples. As the technology marches forward, most people who visit hospitals will eventually have their (possibly tissue-specific) genomes sequenced. Genomic data will be collected from thousands to millions of non-model organisms and their populations in order to assess the biodiversity within the corresponding ecosystem. Complex microbial communities will be sampled from thousands of geographic locations to study the influence of environmental conditions. Furthermore, these studies will involve continuous data collection efforts, for the purpose of monitoring the dynamic changes in biosystems by the use of genome-wide or transcriptome-wide sequencing. As a result, genomic data generation is to occur at an unprecedented pace, necessitating the development of novel algorithms to help reduce the burden of genomic sequence data on computational, storage and transmission systems.

This project combines the unique strengths of the two investigators at Indiana University, bringing a principled, algorithmic approach to critical infrastructure problems in genomics. The project will address the needs of the next stage of genomic data generation by mega cancer projects, portable devices collecting environmental samples, and even smaller sensors to be embedded in the human body, through the use of new compression tools and compressed data structures for communicating, storing, managing, and accessing large collections of (streaming) genome data. For this purpose, we will employ and expand the existing algorithmic repertoire involving approximation algorithms, sublinear algorithms, lossless data compression, I/O efficient, memory hierarchy aware/oblivious and compressed data structures.

Project Start
Project End
Budget Start
2016-08-01
Budget End
2020-07-31
Support Year
Fiscal Year
2016
Total Cost
$400,000
Indirect Cost
Name
Indiana University
Department
Type
DUNS #
City
Bloomington
State
IN
Country
United States
Zip Code
47401