Genomic Compression: From Information Theory to Parallel Algorithms

Milenkovic, Olgica; Weissman, Tsachy

Abstract

One of the highest priorities of modern healthcare research and practice is to identify genomic changes and markers that predispose individuals to debilitating diseases or make them more responsive to certain therapies and emerging treatments. Timely discovery and knowledge mining in this area of medical research is largely enabled by massive DNA sequencing and functional genomic data, the volumes of which are expected to experience drastic growth in the near future. It is therefore of paramount importance to develop efficient, accurate, and low-latency data compression and decompression techniques that will allow for fast exchange, dissemination, random access, visualization and search of diversely formatted genomic information. The use of specialized compression methods for biological data will ensure unprecedented growth of NIH databases and their utility, new uses of crowd-sourced computing in medical research, and large scale dissemination of experimental results.
Specific aims of the proposal include developing parallel, task-oriented algorithms for a reference-based and reference-free compression of reads and whole genomes; b) lossy compression of quality scores; and c) compression of functional genomic data. Although the three data categories have different statistical properties and formats, they may be compressed using similar combinations of pre-processing, statistical coding, and parallel algorithms. Furthermore, some of the universal features of the developed compression techniques will make it possible to successfully apply them on other emerging genomic data formats. The long-term objectives of the proposed research program are two-fold. The first objective is to perform fundamental analytical studies of lossless and certain restricted forms of lossy compression and dimensionality reduction methods for genomic and functional genomic data, using information-theoretic techniques. The second objective is to develop a new suite of parallel algorithms for SAM, FASTQ and Wig track data compression. The developed algorithms are expected to include suitably combined, modified and extended classical compression methods (e.g., arithmetic, Huffman, and Lempel-Ziv coding), as well as novel solutions based on context-mixing and context-tree weighting with biological side-information. Immediate goals of the project include using CUDA, as well as classical parallel computing platforms, to implement current compression algorithms in order to reduce the latency of the compression and decompression process. Novel components of the parallel implementations will include extensive use of state-of-the-art hashing, indexing, and stringing methods. SAM, FASTQ and Wig data les are ubiquitous in genomic research. Hence, a research program resulting in high-performance software suites for compression of these and other genomic information formats will enable management, transfer and access to massive data crucial for the operation of governmental and NIH sponsored projects such as ENCODE, TCGA, ClinVar, Genome 10K, the Million Cancer Genome Warehouse, and ADAM.

Public Health Relevance

Genomic and functional genomic data is essential for biomedical research, but costly to store, access and process in the Big Data era. An alternative to the undesired practice of aggressively archiving and discarding data which may prove to be of importance for future research efforts is fast and efficient data compression. Although a number of methods for genomic data compression was put forward, most are not sufficiently specialized to the format, statistics and volumes of genomic data, and may be significantly improved upon using novel information- theoretic approaches and parallel computing platforms.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Cancer Institute (NCI)
Type: Research Project--Cooperative Agreements (U01)
Project #: 5U01CA198943-03
Application #: 9259954
Study Section: Special Emphasis Panel (ZRG1-BST-N (50)R)
Program Officer: Li, Jerry

Project Start: 2015-06-01
Project End: 2018-05-31
Budget Start: 2017-06-01
Budget End: 2018-05-31
Support Year: 3
Fiscal Year: 2017
Total Cost: $303,536
Indirect Cost: $75,804

Institution

Name: University of Illinois Urbana-Champaign
Department: Engineering (All Types)
Type: Schools of Engineering
DUNS #: 041544081

City: Champaign
State: IL
Country: United States
Zip Code: 61820

Related projects


NIH 2017 U01 CA	Genomic Compression: From Information Theory to Parallel Algorithms Milenkovic, Olgica; Weissman, Tsachy / University of Illinois Urbana-Champaign	$303,536
NIH 2016 U01 CA	Genomic Compression: From Information Theory to Parallel Algorithms Milenkovic, Olgica; Weissman, Tsachy / University of Illinois Urbana-Champaign
NIH 2016 U01 CA	Genomic Compression: From Information Theory to Parallel Algorithms Milenkovic, Olgica; Weissman, Tsachy / University of Illinois Urbana-Champaign	$167,294
NIH 2015 U01 CA	Genomic Compression: From Information Theory to Parallel Algorithms Milenkovic, Olgica; Weissman, Tsachy / University of Illinois Urbana-Champaign

Publications

Chandak, Shubham; Tatwawadi, Kedar; Weissman, Tsachy (2018) Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34:558-567

Lee, Byunghan; Moon, Taesup; Yoon, Sungroh et al. (2017) DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 12:e0181463

Long, Reggy; Hernaez, Mikel; Ochoa, Idoia et al. (2017) GeneComp, a new reference-based compressor for SAM files. Proc Data Compress Conf 2017:330-339

Dau, Hoang; Milenkovic, Olgica (2017) Latent Network Features and Overlapping Community Discovery via Boolean Intersection Representations. IEEE ACM Trans Netw 25:3219-3234

Pavlichin, Dmitri S; Ingber, Amir; Weissman, Tsachy (2017) Compressing Tabular Data via Pairwise Dependencies. Proc Data Compress Conf 2017:455

Ochoa, Idoia; Hernaez, Mikel; Goldfeder, Rachel et al. (2017) Effect of lossy compression of quality scores on variant calling. Brief Bioinform 18:183-194

No, Albert; Weissman, Tsachy (2016) Rateless Lossy Compression via the Extremes. IEEE Trans Inf Theory 62:5484-5495

Wang, Zhiying; Weissman, Tsachy; Milenkovic, Olgica (2016) smallWig: parallel compression of RNA-seq WIG files. Bioinformatics 32:173-80

Tatwawadi, Kedar; Hernaez, Mikel; Ochoa, Idoia et al. (2016) GTRAC: fast retrieval from compressed collections of genomic variants. Bioinformatics 32:i479-i486

Kim, Minji; Zhang, Xiejia; Ligo, Jonathan G et al. (2016) MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinformatics 17:94

Showing the most recent 10 out of 26 publications

Comments

Be the first to comment on Olgica Milenkovic's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: