Efficient Data Structures and Algorithms for Genomics Sequence Data

Baldi, Pierre

Abstract

The long-term objective of this project is to develop and deploy efficient data structures and algorithms for the storage, transmission, querying, privacy protection, and management of large-scale High-Throughput Sequencing (HTS) and genomic information. HTS technologies are in the process of driving profound, revolutionary, changes in biology and medicine. They are becoming the tool of choice for addressing fundamental questions in biology, from evolution to gene regulation, and for providing the foundation to the personalized medicine of tomorrow, as it becomes possible to cheaply resequence individual genomes. A project to sequence 1,000 human genomes is well underway and soon it will be possible to sequence a human genome for less than $1,000. In addition to the obvious challenges to understand the structure, function, and evolution of genomes, modern high-throughput genome sequencing methods also raise questions about how to efficiently represent, store, transmit, query, and protect the privacy of genomic sequence information. Currently, HTS and genome data are typically stored using a flat-text file format which is inefficient not only in terms of storage capacity and communication bandwidth, but also in terms of information extraction and security. The proposed effort aims at removing this fundamental bottleneck and address the genomic data deluge by: (1) Developing efficient data structures and compression algorithms for HTS and genomic data that support also rapid extraction and protection of genomic information, with compression factors for genomic data in the range of 1,000 and beyond;(2) Developing security and privacy preserving algorithms and protocols to protect genomic data;(3) Implementing and testing these data structures, protocols, and algorithms on a variety of data including HTS data from different HTS technologies (e.g. Solexa, SOLiD, 454), individual human mitochondrial genome data (e.g. the MITOMAP database), individual human SNP data (e.g. dbSNP), and individual human genome data (e.g. The 1000 Genome Project);and (3) Validating and deploying the technology through multiple channels, from publications, to Web servers, to distribution of optimized software, to collaborations with life- scientists, HTS companies, and large sequencing centers.

Public Health Relevance

Sequencing individual human genomes will soon be affordable. This proposal seeks to develop efficient computer methods for storing, communicating, querying, and protecting genomic information to support the personalized medicine of the future, where all health-related decisions will take into account the particular genetic makeup of each individual.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM010235-02
Application #: 8138561
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Ye, Jane

Project Start: 2010-09-30
Project End: 2013-09-29
Budget Start: 2011-09-30
Budget End: 2012-09-29
Support Year: 2
Fiscal Year: 2011
Total Cost: $167,930
Indirect Cost

Institution

Name: University of California Irvine
Department
Type: Other Domestic Higher Education
DUNS #: 046705849

City: Irvine
State: CA
Country: United States
Zip Code: 92697

Related projects


NIH 2012 R01 LM	Efficient Data Structures and Algorithms for Genomics Sequence Data Baldi, Pierre / University of California Irvine	$162,422
NIH 2011 R01 LM	Efficient Data Structures and Algorithms for Genomics Sequence Data Baldi, Pierre / University of California Irvine	$167,930
NIH 2010 R01 LM	Efficient Data Structures and Algorithms for Genomics Sequence Data Baldi, Pierre / University of California Irvine	$176,932

Publications

Ioris, Rafael M; Galié, Mirco; Ramadori, Giorgio et al. (2017) SIRT6 Suppresses Cancer Stem-like Capacity in Tumors with PI3K Activation Independently of Its Deacetylase Activity. Cell Rep 18:1858-1868

Borrego, Stacey L; Fahrmann, Johannes; Datta, Rupsa et al. (2016) Metabolic changes associated with methionine stress sensitivity in MDA-MB-468 breast cancer cells. Cancer Metab 4:9

Masri, Selma; Papagiannakopoulos, Thales; Kinouchi, Kenichiro et al. (2016) Lung Adenocarcinoma Distally Rewires Hepatic Circadian Homeostasis. Cell 165:896-909

Biehl, Michael; Sadowski, Peter; Bhanot, Gyan et al. (2015) Inter-species prediction of protein phosphorylation in the sbv IMPROVER species translation challenge. Bioinformatics 31:453-61

Lusci, Alessandro; Browning, Michael; Fooshee, David et al. (2015) Accurate and efficient target prediction using a potency-sensitive influence-relevance voter. J Cheminform 7:63

Gordon, William M; Zeller, Michael D; Klein, Rachel H et al. (2014) A GRHL3-regulated repair pathway suppresses immune-mediated epidermal hyperplasia. J Clin Invest 124:5205-18

Baldi, Pierre; Sadowski, Peter (2014) The Dropout Learning Algorithm. Artif Intell 210:78-122

Goodrich, Michael T (2014) Spin-the-bottle Sort and Annealing Sort: Oblivious Sorting via Round-robin Random Comparisons. Algorithmica 68:835-858

Nagata, Ken; Randall, Arlo; Baldi, Pierre (2014) Incorporating post-translational modifications and unnatural amino acids into high-throughput modeling of protein structures. Bioinformatics 30:1681-9

Masri, Selma; Rigor, Paul; Cervantes, Marlene et al. (2014) Partitioning circadian transcription by SIRT6 leads to segregated control of cellular metabolism. Cell 158:659-72

Showing the most recent 10 out of 39 publications

Comments

Be the first to comment on Pierre Baldi's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: