One of the highest priorities of modern healthcare research and practice is to identify genomic changes and markers that predispose individuals to debilitating diseases or make them more responsive to certain therapies and emerging treatments. Timely discovery and knowledge mining in this area of medical research is largely enabled by massive DNA sequencing and functional genomic data, the volumes of which are expected to experience drastic growth in the near future. It is therefore of paramount importance to develop efficient, accurate, and low-latency data compression and decompression techniques that will allow for fast exchange, dissemination, random access, visualization and search of diversely formatted genomic information. The use of specialized compression methods for biological data will ensure unprecedented growth of NIH databases and their utility, new uses of crowd-sourced computing in medical research, and large scale dissemination of experimental results.
Specific aims of the proposal include developing parallel, task-oriented algorithms for a reference-based and reference-free compression of reads and whole genomes; b) lossy compression of quality scores; and c) compression of functional genomic data. Although the three data categories have different statistical properties and formats, they may be compressed using similar combinations of pre-processing, statistical coding, and parallel algorithms. Furthermore, some of the universal features of the developed compression techniques will make it possible to successfully apply them on other emerging genomic data formats. The long-term objectives of the proposed research program are two-fold. The first objective is to perform fundamental analytical studies of lossless and certain restricted forms of lossy compression and dimensionality reduction methods for genomic and functional genomic data, using information-theoretic techniques. The second objective is to develop a new suite of parallel algorithms for SAM, FASTQ and Wig track data compression. The developed algorithms are expected to include suitably combined, modified and extended classical compression methods (e.g., arithmetic, Huffman, and Lempel-Ziv coding), as well as novel solutions based on context-mixing and context-tree weighting with biological side-information. Immediate goals of the project include using CUDA, as well as classical parallel computing platforms, to implement current compression algorithms in order to reduce the latency of the compression and decompression process. Novel components of the parallel implementations will include extensive use of state-of-the-art hashing, indexing, and stringing methods. SAM, FASTQ and Wig data les are ubiquitous in genomic research. Hence, a research program resulting in high-performance software suites for compression of these and other genomic information formats will enable management, transfer and access to massive data crucial for the operation of governmental and NIH sponsored projects such as ENCODE, TCGA, ClinVar, Genome 10K, the Million Cancer Genome Warehouse, and ADAM.
Genomic and functional genomic data is essential for biomedical research, but costly to store, access and process in the Big Data era. An alternative to the undesired practice of aggressively archiving and discarding data which may prove to be of importance for future research efforts is fast and efficient data compression. Although a number of methods for genomic data compression was put forward, most are not sufficiently specialized to the format, statistics and volumes of genomic data, and may be significantly improved upon using novel information- theoretic approaches and parallel computing platforms.
Chandak, Shubham; Tatwawadi, Kedar; Weissman, Tsachy (2018) Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34:558-567 |
Lee, Byunghan; Moon, Taesup; Yoon, Sungroh et al. (2017) DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 12:e0181463 |
Long, Reggy; Hernaez, Mikel; Ochoa, Idoia et al. (2017) GeneComp, a new reference-based compressor for SAM files. Proc Data Compress Conf 2017:330-339 |
Dau, Hoang; Milenkovic, Olgica (2017) Latent Network Features and Overlapping Community Discovery via Boolean Intersection Representations. IEEE ACM Trans Netw 25:3219-3234 |
Pavlichin, Dmitri S; Ingber, Amir; Weissman, Tsachy (2017) Compressing Tabular Data via Pairwise Dependencies. Proc Data Compress Conf 2017:455 |
Ochoa, Idoia; Hernaez, Mikel; Goldfeder, Rachel et al. (2017) Effect of lossy compression of quality scores on variant calling. Brief Bioinform 18:183-194 |
Steiner, Fabian; Dempfle, Steffen; Ingber, Amir et al. (2016) Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes. IEEE Trans Inf Theory 62:2737-2747 |
Deorowicz, Sebastian; Grabowski, Szymon; Ochoa, Idoia et al. (2016) Comment on: 'ERGC: an efficient referential genome compression algorithm'. Bioinformatics 32:1115-7 |
Ochoa, I; No, A; Hernaez, M et al. (2016) CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores. Proc Inf Theory Workshop 2016:121-125 |
Han, Yanjun; Jiao, Jiantao; Weissman, Tsachy (2016) Minimax Rate-optimal Estimation of KL Divergence between Discrete Distributions. Int Symp Inf Theory Appl 2016:256-260 |
Showing the most recent 10 out of 26 publications