The long-term objective of this project is to develop and deploy efficient data structures and algorithms for the storage, transmission, querying, privacy protection, and management of large-scale High-Throughput Sequencing (HTS) and genomic information. HTS technologies are in the process of driving profound, revolutionary, changes in biology and medicine. They are becoming the tool of choice for addressing fundamental questions in biology, from evolution to gene regulation, and for providing the foundation to the personalized medicine of tomorrow, as it becomes possible to cheaply resequence individual genomes. A project to sequence 1,000 human genomes is well underway and soon it will be possible to sequence a human genome for less than $1,000. In addition to the obvious challenges to understand the structure, function, and evolution of genomes, modern high-throughput genome sequencing methods also raise questions about how to efficiently represent, store, transmit, query, and protect the privacy of genomic sequence information. Currently, HTS and genome data are typically stored using a flat-text file format which is inefficient not only in terms of storage capacity and communication bandwidth, but also in terms of information extraction and security. The proposed effort aims at removing this fundamental bottleneck and address the genomic data deluge by: (1) Developing efficient data structures and compression algorithms for HTS and genomic data that support also rapid extraction and protection of genomic information, with compression factors for genomic data in the range of 1,000 and beyond;(2) Developing security and privacy preserving algorithms and protocols to protect genomic data;(3) Implementing and testing these data structures, protocols, and algorithms on a variety of data including HTS data from different HTS technologies (e.g. Solexa, SOLiD, 454), individual human mitochondrial genome data (e.g. the MITOMAP database), individual human SNP data (e.g. dbSNP), and individual human genome data (e.g. The 1000 Genome Project);and (3) Validating and deploying the technology through multiple channels, from publications, to Web servers, to distribution of optimized software, to collaborations with life- scientists, HTS companies, and large sequencing centers.

Public Health Relevance

Sequencing individual human genomes will soon be affordable. This proposal seeks to develop efficient computer methods for storing, communicating, querying, and protecting genomic information to support the personalized medicine of the future, where all health-related decisions will take into account the particular genetic makeup of each individual.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Irvine
Other Domestic Higher Education
United States
Zip Code
Ioris, Rafael M; GaliƩ, Mirco; Ramadori, Giorgio et al. (2017) SIRT6 Suppresses Cancer Stem-like Capacity in Tumors with PI3K Activation Independently of Its Deacetylase Activity. Cell Rep 18:1858-1868
Borrego, Stacey L; Fahrmann, Johannes; Datta, Rupsa et al. (2016) Metabolic changes associated with methionine stress sensitivity in MDA-MB-468 breast cancer cells. Cancer Metab 4:9
Masri, Selma; Papagiannakopoulos, Thales; Kinouchi, Kenichiro et al. (2016) Lung Adenocarcinoma Distally Rewires Hepatic Circadian Homeostasis. Cell 165:896-909
Biehl, Michael; Sadowski, Peter; Bhanot, Gyan et al. (2015) Inter-species prediction of protein phosphorylation in the sbv IMPROVER species translation challenge. Bioinformatics 31:453-61
Lusci, Alessandro; Browning, Michael; Fooshee, David et al. (2015) Accurate and efficient target prediction using a potency-sensitive influence-relevance voter. J Cheminform 7:63
Gordon, William M; Zeller, Michael D; Klein, Rachel H et al. (2014) A GRHL3-regulated repair pathway suppresses immune-mediated epidermal hyperplasia. J Clin Invest 124:5205-18
Baldi, Pierre; Sadowski, Peter (2014) The Dropout Learning Algorithm. Artif Intell 210:78-122
Goodrich, Michael T (2014) Spin-the-bottle Sort and Annealing Sort: Oblivious Sorting via Round-robin Random Comparisons. Algorithmica 68:835-858
Nagata, Ken; Randall, Arlo; Baldi, Pierre (2014) Incorporating post-translational modifications and unnatural amino acids into high-throughput modeling of protein structures. Bioinformatics 30:1681-9
Masri, Selma; Rigor, Paul; Cervantes, Marlene et al. (2014) Partitioning circadian transcription by SIRT6 leads to segregated control of cellular metabolism. Cell 158:659-72

Showing the most recent 10 out of 39 publications