The state of big data analytics in the field of HIV/AIDS research is critically lacking. Decreasing cost of sequencing stimulated the development of novel software tools and analysis frameworks. The bulk of these efforts has been driven by truly expansive (and well-funded) collaborative projects such as the 1000 genomes, ENCODE, modENCODE, GTEx, the Human Microbiome, the Cancer Genome Atlas, and others. While these projects hardened many aspects of NGS data analysis and manipulation, as well as established standards for data representation (e.g. BAM, VCF, CRAM formats) they were facing a set of challenges that is markedly distinct from those faced by HIV researchers, e.g. long stable genomes with few mutations (i.e., human) versus short variable genomes with many mutations (i.e., HIV). Consequently, the development of HIV-specific tools and applications for next generation sequencing (NGS) has largely been the domain of individual labs, independently designing sensible ad hoc, yet disaggregated, solutions to common problems, resulting in a fragmented field largely without accepted standards and gaps between available solutions and the needs of end users. The current practice of writing ?full-stack? custom in-house solutions for NGS analyses is not scalable, not maintainable, largely fails to leverage the developments from other domains of NGS data analysis, and hampers the adoption of this transformative technology in HIV research.
The specific aims of this proposal address practical aspects of HIV/AIDS-related NGS analysis by assembling proven and newly developed tools and modules into ?data to answer? series of workflows, and creating a publicly available and accessible turnkey solution suitable for a large proportion of HIV/AIDS researchers needing to perform routine and bespoke analyses of NGS data..

Public Health Relevance

This proposal brings together three research groups (Penn State, Temple, and Johns Hopkins), with combined expertise and accomplishments in comprehensive sequence-based HIV/AIDS research, open source high-throughput genomic tools and framework development, reproducible scientific computation, and data visualization. Jointly, we will develop a single-point publicly accessible informatics and analytical ?data-to-answer framework? enabling HIV researchers to process, store, and retrieve NGS data, and to translate these data into actionable results.

Agency
National Institute of Health (NIH)
Institute
National Institute of Allergy and Infectious Diseases (NIAID)
Type
Research Project (R01)
Project #
5R01AI134384-02
Application #
9511742
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Gezmu, Misrak
Project Start
2017-06-26
Project End
2022-05-31
Budget Start
2018-06-01
Budget End
2019-05-31
Support Year
2
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Pennsylvania State University
Department
Biochemistry
Type
Schools of Arts and Sciences
DUNS #
003403953
City
University Park
State
PA
Country
United States
Zip Code
16802
Batut, Bérénice; Hiltemann, Saskia; Bagnacani, Andrea et al. (2018) Community-Driven Data Analysis Training for Biology. Cell Syst 6:752-758.e1
Frost, Simon D W; Magalis, Brittany Rife; Kosakovsky Pond, Sergei L (2018) Neutral Theory and Rapidly Evolving Viral Pathogens. Mol Biol Evol 35:1348-1354
Shank, Stephen D; Weaver, Steven; Kosakovsky Pond, Sergei L (2018) phylotree.js - a JavaScript library for application development and interactive data visualization in phylogenetics. BMC Bioinformatics 19:276
Grüning, Björn; Chilton, John; Köster, Johannes et al. (2018) Practical Computational Reproducibility in the Life Sciences. Cell Syst 6:631-635
Nekrutenko, Anton; Team, Galaxy; Goecks, Jeremy et al. (2018) Biology Needs Evolutionary Software Tools: Let's Build Them Right. Mol Biol Evol 35:1372-1375
Kosakovsky Pond, Sergei L; Weaver, Steven; Leigh Brown, Andrew J et al. (2018) HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens. Mol Biol Evol 35:1812-1819