We will develop methods and maintain software that make it radically easier for biomedical researchers to use and understand sequencing data. The project will support our maintaining and improving our popular ?upstream? tools for analyzing sequencing data. These include the Bowtie and Bowtie 2 tools for read alignment, the Kraken 2 tool for metagenomics classi?cation and the Dashing tool for genomic sketching and comparison. We will also develop new systems that allow researchers to use these same core tools (Bowtie, Kraken 2, Dashing) to rapidly discover and vet archived datasets. We will enable researchers to quickly ascertain whether a dataset is of high quality, what species are present, whether contaminants are present, what assay was performed, what datasets are similar to each other, and what datasets are inconsistent with annotated metadata. In this way, researchers can distill relevant archived datasets, those having the expected biological properties, in a way that does not hinge on the accuracy of the associated metadata. Finally, we will work to develop new infrastructure for large-scale reanalysis and indexing of archived data, ultimately yielding new ?search engines? for scienti?c question-answering. In particular, we will extend our past work on the Rail-RNA, recount2 and Snaptron so that we can more effectively analyze huge collections of archived data, converting them into a variety of useful summary forms, and than adding a layer of indexing so that users can query the summaries in the context of a scienti?c investigation. We will also create new catalogs and mechanisms whereby researchers can share their archive-assisted study designs, so that useful combinations of archived datasets, and insights into where their metadata might be incorrect or incomplete, can be reported and shared.

Public Health Relevance

Many researchers use DNA sequencing to study disease and biology, and analyzing this data requires sophisticated software capable of piecing together puzzles made of billions of fragments of DNA. Mean- while, public archives are ?lling with huge datasets that could be used in everyday research, but they are not organized in a convenient way. We propose a set of projects that allow everyday scientists to easily (a) analyze sequencing data, (b) examine archived sequencing datasets to ?nd those most relevant to their research, and (c) provide a ?search engine? for answering scienti?c questions with respect to the archive.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Unknown (R35)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Biostatistics & Other Math Sci
Biomed Engr/Col Engr/Engr Sta
United States
Zip Code