We propose to create new software and analysis methods designed to make possible the exploration of a unique dataset, the 1,004 genomes sequenced by the Consortium on Asthma among African-Ancestry Populations in the Americas (CAAPA). The size of this dataset, over 130 Terabytes, currently prevents it from being explored with alignment-based tools, and researchers instead are limited to using the much smaller files containing single-nucleotide variants. Our proposed software will make this dataset and others like it available for real- time searching, a capability that is not yet possible for any genomic database of this size. Since the early 1990s, scientists have used DNA sequence databases to study a wide range of problems, including novel gene discovery, mutation detection, the investigation of larger structural variants, and evolutionary processes. The ability to search all known genes and genomes using BLAST and similar programs has long been assumed, and sequence search engines throughout the world provide this ability. However, the vast size of the CAAPA dataset makes it impossible to search the data itself using current tools. One cannot look for specific mutations, extract and re-analyze data for any particular gene or regulatory region, or look for structural variants. Newer, fast next-generation sequence alignment programs such as Bowtie, originally developed in our group, allow far faster alignment of NGS reads to the genome, but even these programs cannot search data on the scale of CAAPA in real time. Different architectures need to be designed and built to accommodate these very large datasets. The CAAPA exploration system (CESYS) will use a combination of a highly efficient database, very fast storage, and fast search algorithms to achieve our goals. This project aims to accomplish several goals that will dramatically enhance the value of CAAPA. First, the data will be made available to a very large community of researchers, who can use it not only to study the genetics of asthma and allergy in the CAAPA populations, but also to compare these subjects to other groups. The data currently resides on hard drives and is available only to a small number of the project's PIs, a situation that limits its value. Second, b creating an authentication system consistent with dbGaP, we will create a data sharing model that other projects can use and that will remove some of the technical barriers to sharing genome data from human subjects. Third, as part of building the database, we will re-call all the SNPs using the newly released human genome build (hg20), creating a consistent set of variants that we will also share freely through the project database. Fourth, we will identify all bacterial contaminants, including those in a subset of subjects known to have bloodstream infections at the time of sample collection. Fifth, we will identify structural variants unique to he CAAPA population, which we can then explore for any association with the risk of asthma.

Public Health Relevance

The CAAPA project has generated over 1000 genomes from a population of African-ancestry individuals living in the United States, Central and South America, and Africa. The data is so voluminous that scientists cannot download or search it, a problem that will only get worse as the database grows. This project will create new software, tools, and resources that make the CAAPA genome data searchable for the first time, allowing scientists to discover DNA sequences that might be responsible for traits and diseases for which this population has an unusually high risk.

Agency
National Institute of Health (NIH)
Institute
National Heart, Lung, and Blood Institute (NHLBI)
Type
Research Project (R01)
Project #
5R01HL129239-02
Application #
9096211
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Gan, Weiniu
Project Start
2015-07-01
Project End
2018-06-30
Budget Start
2016-07-01
Budget End
2017-06-30
Support Year
2
Fiscal Year
2016
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Biomedical Engineering
Type
Schools of Medicine
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205
Li, Zhigang; Breitwieser, Florian P; Lu, Jennifer et al. (2018) Identifying Corneal Infections in Formalin-Fixed Specimens Using Next Generation Sequencing. Invest Ophthalmol Vis Sci 59:280-288
Luo, Ruibang; Schatz, Michael C; Salzberg, Steven L (2017) 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. Gigascience 6:1-4
Luo, Ruibang; Zimin, Aleksey; Workman, Rachael et al. (2017) First Draft Genome Sequence of the Pathogenic Fungus Lomentospora prolificans (Formerly Scedosporium prolificans). G3 (Bethesda) 7:3831-3836