The long-term objective of this project is to develop and deploy efficient data structures and algorithms for the storage, transmission, querying, privacy protection, and management of large-scale High-Throughput Sequencing (HTS) and genomic information. HTS technologies are in the process of driving profound, revolutionary, changes in biology and medicine. They are becoming the tool of choice for addressing fundamental questions in biology, from evolution to gene regulation, and for providing the foundation to the personalized medicine of tomorrow, as it becomes possible to cheaply resequence individual genomes. A project to sequence 1,000 human genomes is well underway and soon it will be possible to sequence a human genome for less than $1,000. In addition to the obvious challenges to understand the structure, function, and evolution of genomes, modern high-throughput genome sequencing methods also raise questions about how to efficiently represent, store, transmit, query, and protect the privacy of genomic sequence information. Currently, HTS and genome data are typically stored using a flat-text file format which is inefficient not only in terms of storage capacity and communication bandwidth, but also in terms of information extraction and security. The proposed effort aims at removing this fundamental bottleneck and address the genomic data deluge by: (1) Developing efficient data structures and compression algorithms for HTS and genomic data that support also rapid extraction and protection of genomic information, with compression factors for genomic data in the range of 1,000 and beyond;(2) Developing security and privacy preserving algorithms and protocols to protect genomic data;(3) Implementing and testing these data structures, protocols, and algorithms on a variety of data including HTS data from different HTS technologies (e.g. Solexa, SOLiD, 454), individual human mitochondrial genome data (e.g. the MITOMAP database), individual human SNP data (e.g. dbSNP), and individual human genome data (e.g. The 1000 Genome Project);and (3) Validating and deploying the technology through multiple channels, from publications, to Web servers, to distribution of optimized software, to collaborations with life- scientists, HTS companies, and large sequencing centers.
Sequencing individual human genomes will soon be affordable. This proposal seeks to develop efficient computer methods for storing, communicating, querying, and protecting genomic information to support the personalized medicine of the future, where all health-related decisions will take into account the particular genetic makeup of each individual.
Showing the most recent 10 out of 39 publications