The long-term objective of this project is to develop and deploy efficient data structures and algorithms for the storage, transmission, querying, privacy protection, and management of large-scale High-Throughput Sequencing (HTS) and genomic information. HTS technologies are in the process of driving profound, revolutionary, changes in biology and medicine. They are becoming the tool of choice for addressing fundamental questions in biology, from evolution to gene regulation, and for providing the foundation to the personalized medicine of tomorrow, as it becomes possible to cheaply resequence individual genomes. A project to sequence 1,000 human genomes is well underway and soon it will be possible to sequence a human genome for less than $1,000. In addition to the obvious challenges to understand the structure, function, and evolution of genomes, modern high-throughput genome sequencing methods also raise questions about how to efficiently represent, store, transmit, query, and protect the privacy of genomic sequence information. Currently, HTS and genome data are typically stored using a flat-text file format which is inefficient not only in terms of storage capacity and communication bandwidth, but also in terms of information extraction and security. The proposed effort aims at removing this fundamental bottleneck and address the genomic data deluge by: (1) Developing efficient data structures and compression algorithms for HTS and genomic data that support also rapid extraction and protection of genomic information, with compression factors for genomic data in the range of 1,000 and beyond;(2) Developing security and privacy preserving algorithms and protocols to protect genomic data;(3) Implementing and testing these data structures, protocols, and algorithms on a variety of data including HTS data from different HTS technologies (e.g. Solexa, SOLiD, 454), individual human mitochondrial genome data (e.g. the MITOMAP database), individual human SNP data (e.g. dbSNP), and individual human genome data (e.g. The 1000 Genome Project);and (3) Validating and deploying the technology through multiple channels, from publications, to Web servers, to distribution of optimized software, to collaborations with life- scientists, HTS companies, and large sequencing centers.

Public Health Relevance

Sequencing individual human genomes will soon be affordable. This proposal seeks to develop efficient computer methods for storing, communicating, querying, and protecting genomic information to support the personalized medicine of the future, where all health-related decisions will take into account the particular genetic makeup of each individual.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Irvine
Other Domestic Higher Education
United States
Zip Code
Magnan, Christophe N; Baldi, Pierre (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 30:2592-7
Nagata, Ken; Randall, Arlo; Baldi, Pierre (2014) Incorporating post-translational modifications and unnatural amino acids into high-throughput modeling of protein structures. Bioinformatics 30:1681-9
Masri, Selma; Rigor, Paul; Cervantes, Marlene et al. (2014) Partitioning circadian transcription by SIRT6 leads to segregated control of cellular metabolism. Cell 158:659-72
Patel, V R; Eckel-Mahan, K; Sassone-Corsi, P et al. (2014) How pervasive are circadian oscillations? Trends Cell Biol 24:329-31
Baldi, Pierre; Sadowski, Peter (2014) The Dropout Learning Algorithm. Artif Intell 210:78-122
Goodrich, Michael T (2014) Spin-the-bottle Sort and Annealing Sort: Oblivious Sorting via Round-robin Random Comparisons. Algorithmica 68:835-858
Bellet, Marina M; Deriu, Elisa; Liu, Janet Z et al. (2013) Circadian clock regulates the host response to Salmonella. Proc Natl Acad Sci U S A 110:9897-902
Feher, Victoria A; Randall, Arlo; Baldi, Pierre et al. (2013) A 3-dimensional trimeric *-barrel model for Chlamydia MOMP contains conserved and novel elements of Gram-negative bacterial porins. PLoS One 8:e68934
Fujikawa, Teppei; Berglund, Eric D; Patel, Vishal R et al. (2013) Leptin engages a hypothalamic neurocircuitry to permit survival in the absence of insulin. Cell Metab 18:431-44
Chang, Ivan; Baldi, Pierre (2013) A unifying kinetic framework for modeling oxidoreductase-catalyzed reactions. Bioinformatics 29:1299-307

Showing the most recent 10 out of 18 publications