A Software Framework for Exploring 1,000 Genomes of African Descent

Salzberg, Steven; Barnes, Kathleen

Abstract

We propose to create new software and analysis methods designed to make possible the exploration of a unique dataset, the 1,004 genomes sequenced by the Consortium on Asthma among African-Ancestry Populations in the Americas (CAAPA). The size of this dataset, over 130 Terabytes, currently prevents it from being explored with alignment-based tools, and researchers instead are limited to using the much smaller files containing single-nucleotide variants. Our proposed software will make this dataset and others like it available for real- time searching, a capability that is not yet possible for any genomic database of this size. Since the early 1990s, scientists have used DNA sequence databases to study a wide range of problems, including novel gene discovery, mutation detection, the investigation of larger structural variants, and evolutionary processes. The ability to search all known genes and genomes using BLAST and similar programs has long been assumed, and sequence search engines throughout the world provide this ability. However, the vast size of the CAAPA dataset makes it impossible to search the data itself using current tools. One cannot look for specific mutations, extract and re-analyze data for any particular gene or regulatory region, or look for structural variants. Newer, fast next-generation sequence alignment programs such as Bowtie, originally developed in our group, allow far faster alignment of NGS reads to the genome, but even these programs cannot search data on the scale of CAAPA in real time. Different architectures need to be designed and built to accommodate these very large datasets. The CAAPA exploration system (CESYS) will use a combination of a highly efficient database, very fast storage, and fast search algorithms to achieve our goals. This project aims to accomplish several goals that will dramatically enhance the value of CAAPA. First, the data will be made available to a very large community of researchers, who can use it not only to study the genetics of asthma and allergy in the CAAPA populations, but also to compare these subjects to other groups. The data currently resides on hard drives and is available only to a small number of the project's PIs, a situation that limits its value. Second, b creating an authentication system consistent with dbGaP, we will create a data sharing model that other projects can use and that will remove some of the technical barriers to sharing genome data from human subjects. Third, as part of building the database, we will re-call all the SNPs using the newly released human genome build (hg20), creating a consistent set of variants that we will also share freely through the project database. Fourth, we will identify all bacterial contaminants, including those in a subset of subjects known to have bloodstream infections at the time of sample collection. Fifth, we will identify structural variants unique to he CAAPA population, which we can then explore for any association with the risk of asthma.

Public Health Relevance

The CAAPA project has generated over 1000 genomes from a population of African-ancestry individuals living in the United States, Central and South America, and Africa. The data is so voluminous that scientists cannot download or search it, a problem that will only get worse as the database grows. This project will create new software, tools, and resources that make the CAAPA genome data searchable for the first time, allowing scientists to discover DNA sequences that might be responsible for traits and diseases for which this population has an unusually high risk.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Heart, Lung, and Blood Institute (NHLBI)
Type: Research Project (R01)
Project #: 5R01HL129239-02
Application #: 9096211
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Gan, Weiniu

Project Start: 2015-07-01
Project End: 2018-06-30
Budget Start: 2016-07-01
Budget End: 2017-06-30
Support Year: 2
Fiscal Year: 2016
Total Cost
Indirect Cost

Institution

Name: Johns Hopkins University
Department: Biomedical Engineering
Type: Schools of Medicine
DUNS #: 001910777

City: Baltimore
State: MD
Country: United States
Zip Code: 21205

Related projects


NIH 2017 R01 HL	A Software Framework for Exploring 1,000 Genomes of African Descent Salzberg, Steven L.; Barnes, Kathleen C. / Johns Hopkins University
NIH 2016 R01 HL	A Software Framework for Exploring 1,000 Genomes of African Descent Salzberg, Steven L.; Barnes, Kathleen C. / Johns Hopkins University
NIH 2015 R01 HL	A Software Framework for Exploring 1,000 Genomes of African Descent Salzberg, Steven L.; Barnes, Kathleen C. / Johns Hopkins University

Publications

Li, Zhigang; Breitwieser, Florian P; Lu, Jennifer et al. (2018) Identifying Corneal Infections in Formalin-Fixed Specimens Using Next Generation Sequencing. Invest Ophthalmol Vis Sci 59:280-288

Luo, Ruibang; Schatz, Michael C; Salzberg, Steven L (2017) 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6:1-4

Luo, Ruibang; Zimin, Aleksey; Workman, Rachael et al. (2017) First Draft Genome Sequence of the Pathogenic Fungus Lomentospora prolificans (Formerly Scedosporium prolificans). G3 (Bethesda) 7:3831-3836

Comments

Be the first to comment on Steven Salzberg's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: