The long-term objective of this project is to develop and deploy efficient data structures and algorithms for the storage, transmission, querying, privacy protection, and management of large-scale High-Throughput Sequencing (HTS) and genomic information. HTS technologies are in the process of driving profound, revolutionary, changes in biology and medicine. They are becoming the tool of choice for addressing fundamental questions in biology, from evolution to gene regulation, and for providing the foundation to the personalized medicine of tomorrow, as it becomes possible to cheaply resequence individual genomes. A project to sequence 1,000 human genomes is well underway and soon it will be possible to sequence a human genome for less than $1,000. In addition to the obvious challenges to understand the structure, function, and evolution of genomes, modern high-throughput genome sequencing methods also raise questions about how to efficiently represent, store, transmit, query, and protect the privacy of genomic sequence information. Currently, HTS and genome data are typically stored using a flat-text file format which is inefficient not only in terms of storage capacity and communication bandwidth, but also in terms of information extraction and security. The proposed effort aims at removing this fundamental bottleneck and address the genomic data deluge by: (1) Developing efficient data structures and compression algorithms for HTS and genomic data that support also rapid extraction and protection of genomic information, with compression factors for genomic data in the range of 1,000 and beyond;(2) Developing security and privacy preserving algorithms and protocols to protect genomic data;(3) Implementing and testing these data structures, protocols, and algorithms on a variety of data including HTS data from different HTS technologies (e.g. Solexa, SOLiD, 454), individual human mitochondrial genome data (e.g. the MITOMAP database), individual human SNP data (e.g. dbSNP), and individual human genome data (e.g. The 1000 Genome Project);and (3) Validating and deploying the technology through multiple channels, from publications, to Web servers, to distribution of optimized software, to collaborations with life- scientists, HTS companies, and large sequencing centers.

Public Health Relevance

Sequencing individual human genomes will soon be affordable. This proposal seeks to develop efficient computer methods for storing, communicating, querying, and protecting genomic information to support the personalized medicine of the future, where all health-related decisions will take into account the particular genetic makeup of each individual.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM010235-03
Application #
8326115
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
2010-09-30
Project End
2014-09-29
Budget Start
2012-09-30
Budget End
2014-09-29
Support Year
3
Fiscal Year
2012
Total Cost
$162,422
Indirect Cost
$44,822
Name
University of California Irvine
Department
Type
Other Domestic Higher Education
DUNS #
046705849
City
Irvine
State
CA
Country
United States
Zip Code
92697
Ioris, Rafael M; GaliƩ, Mirco; Ramadori, Giorgio et al. (2017) SIRT6 Suppresses Cancer Stem-like Capacity in Tumors with PI3K Activation Independently of Its Deacetylase Activity. Cell Rep 18:1858-1868
Borrego, Stacey L; Fahrmann, Johannes; Datta, Rupsa et al. (2016) Metabolic changes associated with methionine stress sensitivity in MDA-MB-468 breast cancer cells. Cancer Metab 4:9
Masri, Selma; Papagiannakopoulos, Thales; Kinouchi, Kenichiro et al. (2016) Lung Adenocarcinoma Distally Rewires Hepatic Circadian Homeostasis. Cell 165:896-909
Biehl, Michael; Sadowski, Peter; Bhanot, Gyan et al. (2015) Inter-species prediction of protein phosphorylation in the sbv IMPROVER species translation challenge. Bioinformatics 31:453-61
Lusci, Alessandro; Browning, Michael; Fooshee, David et al. (2015) Accurate and efficient target prediction using a potency-sensitive influence-relevance voter. J Cheminform 7:63
Masri, Selma; Rigor, Paul; Cervantes, Marlene et al. (2014) Partitioning circadian transcription by SIRT6 leads to segregated control of cellular metabolism. Cell 158:659-72
Magnan, Christophe N; Baldi, Pierre (2014) SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics 30:2592-7
Patel, V R; Eckel-Mahan, K; Sassone-Corsi, P et al. (2014) How pervasive are circadian oscillations? Trends Cell Biol 24:329-31
Gordon, William M; Zeller, Michael D; Klein, Rachel H et al. (2014) A GRHL3-regulated repair pathway suppresses immune-mediated epidermal hyperplasia. J Clin Invest 124:5205-18
Baldi, Pierre; Sadowski, Peter (2014) The Dropout Learning Algorithm. Artif Intell 210:78-122

Showing the most recent 10 out of 39 publications