One of the highest priorities of modern healthcare research and practice is to identify genomic changes and markers that predispose individuals to debilitating diseases or make them more responsive to certain therapies and emerging treatments. Timely discovery and knowledge mining in this area of medical research is largely enabled by massive DNA sequencing and functional genomic data, the volumes of which are expected to experience drastic growth in the near future. It is therefore of paramount importance to develop efficient, accurate, and low-latency data compression and decompression techniques that will allow for fast exchange, dissemination, random access, visualization and search of diversely formatted genomic information. The use of specialized compression methods for biological data will ensure unprecedented growth of NIH databases and their utility, new uses of crowd-sourced computing in medical research, and large scale dissemination of experimental results.
Specific aims of the proposal include developing parallel, task-oriented algorithms for a reference-based and reference-free compression of reads and whole genomes; b) lossy compression of quality scores; and c) compression of functional genomic data. Although the three data categories have different statistical properties and formats, they may be compressed using similar combinations of pre-processing, statistical coding, and parallel algorithms. Furthermore, some of the universal features of the developed compression techniques will make it possible to successfully apply them on other emerging genomic data formats. The long-term objectives of the proposed research program are two-fold. The first objective is to perform fundamental analytical studies of lossless and certain restricted forms of lossy compression and dimensionality reduction methods for genomic and functional genomic data, using information-theoretic techniques. The second objective is to develop a new suite of parallel algorithms for SAM, FASTQ and Wig track data compression. The developed algorithms are expected to include suitably combined, modified and extended classical compression methods (e.g., arithmetic, Huffman, and Lempel-Ziv coding), as well as novel solutions based on context-mixing and context-tree weighting with biological side-information. Immediate goals of the project include using CUDA, as well as classical parallel computing platforms, to implement current compression algorithms in order to reduce the latency of the compression and decompression process. Novel components of the parallel implementations will include extensive use of state-of-the-art hashing, indexing, and stringing methods. SAM, FASTQ and Wig data les are ubiquitous in genomic research. Hence, a research program resulting in high-performance software suites for compression of these and other genomic information formats will enable management, transfer and access to massive data crucial for the operation of governmental and NIH sponsored projects such as ENCODE, TCGA, ClinVar, Genome 10K, the Million Cancer Genome Warehouse, and ADAM.

Public Health Relevance

Genomic and functional genomic data is essential for biomedical research, but costly to store, access and process in the Big Data era. An alternative to the undesired practice of aggressively archiving and discarding data which may prove to be of importance for future research efforts is fast and efficient data compression. Although a number of methods for genomic data compression was put forward, most are not sufficiently specialized to the format, statistics and volumes of genomic data, and may be significantly improved upon using novel information- theoretic approaches and parallel computing platforms.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01CA198943-03
Application #
9259954
Study Section
Special Emphasis Panel (ZRG1-BST-N (50)R)
Program Officer
Li, Jerry
Project Start
2015-06-01
Project End
2018-05-31
Budget Start
2017-06-01
Budget End
2018-05-31
Support Year
3
Fiscal Year
2017
Total Cost
$303,536
Indirect Cost
$75,804
Name
University of Illinois Urbana-Champaign
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
041544081
City
Champaign
State
IL
Country
United States
Zip Code
61820
Chandak, Shubham; Tatwawadi, Kedar; Weissman, Tsachy (2018) Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34:558-567
Lee, Byunghan; Moon, Taesup; Yoon, Sungroh et al. (2017) DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 12:e0181463
Long, Reggy; Hernaez, Mikel; Ochoa, Idoia et al. (2017) GeneComp, a new reference-based compressor for SAM files. Proc Data Compress Conf 2017:330-339
Dau, Hoang; Milenkovic, Olgica (2017) Latent Network Features and Overlapping Community Discovery via Boolean Intersection Representations. IEEE ACM Trans Netw 25:3219-3234
Pavlichin, Dmitri S; Ingber, Amir; Weissman, Tsachy (2017) Compressing Tabular Data via Pairwise Dependencies. Proc Data Compress Conf 2017:455
Ochoa, Idoia; Hernaez, Mikel; Goldfeder, Rachel et al. (2017) Effect of lossy compression of quality scores on variant calling. Brief Bioinform 18:183-194
No, Albert; Weissman, Tsachy (2016) Rateless Lossy Compression via the Extremes. IEEE Trans Inf Theory 62:5484-5495
Wang, Zhiying; Weissman, Tsachy; Milenkovic, Olgica (2016) smallWig: parallel compression of RNA-seq WIG files. Bioinformatics 32:173-80
Tatwawadi, Kedar; Hernaez, Mikel; Ochoa, Idoia et al. (2016) GTRAC: fast retrieval from compressed collections of genomic variants. Bioinformatics 32:i479-i486
Kim, Minji; Zhang, Xiejia; Ligo, Jonathan G et al. (2016) MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinformatics 17:94

Showing the most recent 10 out of 26 publications