Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments

Kingsford, Carleton

Abstract

This proposal aims to solve the sequencing experiment discovery problem. The data from hundreds of thou- sands of short-read sequencing experiments are now publicly available, and private collections of sequencing experiments are also growing rapidly. These experiments include hundreds of thousands of whole genome sequencing experiments, and tens of thousands of RNA-seq, metagenomic, and tumor sequencing samples. However, these experiments are vastly underused, with few analyses making use of more than a handful of ex- periments at a time and most analyses ignoring this collection of raw data entirely. One crucial reason for this is that merely ?nding the appropriate experiments is a signi?cant barrier to their use in downstream analyses. This is due to the lack of a computational platform that can search for relevant short-read sequencing data sets by the sequences they contain. It is not currently possible to ?nd all the metagenomic experiments in which the genes that form a particular pathway are present or to ?nd all experiments in which a novel lncRNA is observed. The experiment discovery problem is that of ?nding ? on a global scale ? those experiments that are relevant to an isoform, variant, or species under study. By building on our existing work in large-scale sequence search, we propose to develop a new distributed platform to index and search hundreds of thousands of raw short-read se- quencing data sets to enable researchers to quickly ?nd experiments that contain their query sequences. We will apply this system to searching RNA-seq, metagenomic, and cancer tumor samples. The research questions we will solve include how to improve the computational scaling, increase the types of biologically meaningful queries that can be answered, and increase our ability to ?nd relevant experiments in situations where muta- tions are common. We will produce a high-quality open-source implementation of the developed computational methods. The project will signi?cantly expand the usefulness of large repositories of raw sequencing reads and enabled new approaches for large-scale reanalysis and reuse of short-read experiments. The system will unlock a rich source of biological information for gene function prediction, for understanding microbial communities, and for connecting genetic variation with disease progression.

Public Health Relevance

Enormous amounts of genomic, metagenomic, and transcriptomic sequence data are being collected for basic science and to inform healthcare treatment, and much of this data is publicly available or available to be shared upon request. However, computational systems for ?nding relevant experiments have lagged behind this data generation. We propose to develop a comprehensive, search-by-sequence experiment discovery platform that allows researchers to share, search, and ?nd experiments by expressed or present sequences and variants in order to facilitate data sharing, data reproducibility, and biomedical discoveries.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 5R01GM122935-04
Application #: 9944630
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Ravichandran, Veerasamy

Project Start: 2017-05-01
Project End: 2021-04-30
Budget Start: 2020-05-01
Budget End: 2021-04-30
Support Year: 4
Fiscal Year: 2020
Total Cost
Indirect Cost

Institution

Name: Carnegie-Mellon University
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 052184116

City: Pittsburgh
State: PA
Country: United States
Zip Code: 15213

Related projects


NIH 2020 R01 GM	Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments Kingsford, Carleton Lee / Carnegie-Mellon University
NIH 2019 R01 GM	Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments Kingsford, Carleton Lee / Carnegie-Mellon University
NIH 2018 R01 GM	Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments Kingsford, Carleton Lee / Carnegie-Mellon University
NIH 2017 R01 GM	Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments Kingsford, Carleton Lee / Carnegie-Mellon University

Publications

Wang, Hao; Kingsford, Carl; McManus, C Joel (2018) Using the Ribodeblur pipeline to recover A-sites from yeast ribosome profiling data. Methods 137:67-70

Lee, Heewook; Kingsford, Carl (2018) Accurate Assembly and Typing of HLA using a Graph-Guided Assembler Kourami. Methods Mol Biol 1802:235-247

Sauerwald, Natalie; Kingsford, Carl (2018) Quantifying the similarity of topological domains across normal and cancer human cell types. Bioinformatics 34:i475-i483

Shao, Mingfu; Kingsford, Carl (2017) Accurate assembly of transcripts through phase-preserving graph decomposition. Nat Biotechnol 35:1167-1169

Marçais, Guillaume; Pellow, David; Bork, Daniel et al. (2017) Improving the performance of minimizers and winnowing schemes. Bioinformatics 33:i110-i117

Comments

Be the first to comment on Carleton Kingsford's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: