Dynamically scalable accessible analysis for next generation sequence data

Taylor, James; Nekrutenko, Anton

Abstract

Wide availability of """"""""next-generation"""""""" sequencing (NGS) instruments has enabled any investigator, for a modest cost, to produce enormous amounts of DNA sequence data. However, working with these raw sequences presents significant problems for individual investigators, small labs, or core facilities. For an experimental group with no computational expertise, simply running a data analysis program is a barrier, let alone building a compute and data storage infrastructure capable of dealing with NGS data. Fortunately, a computational model - """"""""Cloud computing"""""""" - has recently emerged and is ideally suited to the analysis of large- scale sequence data. In this model, computation and storage exist as virtual resources, which can be dynamically allocated and released as needed. Importantly, cloud resources can provide storage and computation at far less cost than dedicated resources for certain use cases. However, formidable challenges need to be addressed to make these resources available to individual investigators. Specifically, although cloud computing provides a way to acquire computational resources on demand, the resources provided are either virtual machines on the Internet or specific programming libraries, which are unusable for experimentalists. Thus, a viable analysis solution needs to be accessible and deployable without informatics expertise;it must efficiently and automatically use dynamically scalable resources, while taking into account time and cost;it must include appropriate analysis tools and easily support addition of new tools as they emerge. We have previously developed a software system - Galaxy (http://galaxyproject.org) - that provides a robust framework for addressing these needs. Here we propose to significantly extend this framework to allow any experimentalist to perform large-scale NGS analyses utilizing the power of cloud computing infrastructure. In particular, we will modify the existing Galaxy framework to run entirely within the cloud. We will adapt the way Galaxy schedules and executes jobs to make effective use of cloud-style. We will provide a mechanism for individual users to create and deploy custom Galaxy instances on a cloud through an entirely web-based interface. Finally, we will test our approach by applying the developed facilities to the existing human re- sequencing data in order to uncover hidden patters of mutations causing human genetic disease on a very large scale.

Public Health Relevance

Project Narrative Increasingly available and inexpensive high-throughput DNA sequencing holds great promise for biomedical research, but informatics challenge block the full realization of the potential of this transformative technology. In particular progress is limited by the informatics and engineering expertise of biomedical researchers, and the availability of sufficient computational infrastructure to analyze these enormous datasets. This project will address these problems by bringing together Galaxy, a system for making complex computational analysis accessible and reproducible, with """"""""cloud computing"""""""", an infrastructure model where computing resources are purchased on demand as needed, making it possible for investigators with no informatics expertise to perform data-intensive analysis using cloud resources.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: High Impact Research and Research Infrastructure Programs (RC2)
Project #: 5RC2HG005542-02
Application #: 7937844
Study Section: Special Emphasis Panel (ZHG1-HGR-N (O1))
Program Officer: Bonazzi, Vivien

Project Start: 2009-09-25
Project End: 2012-07-31
Budget Start: 2010-08-01
Budget End: 2012-07-31
Support Year: 2
Fiscal Year: 2010
Total Cost: $734,945
Indirect Cost

Institution

Name: Emory University
Department: Biology
Type: Schools of Arts and Sciences
DUNS #: 066469933

City: Atlanta
State: GA
Country: United States
Zip Code: 30322

Related projects


NIH 2010 RC2 HG	Dynamically scalable accessible analysis for next generation sequence data Taylor, James Peter; Nekrutenko, Anton / Emory University	$734,945
NIH 2009 RC2 HG	Dynamically scalable accessible analysis for next generation sequence data Taylor, James Peter; Nekrutenko, Anton / Emory University	$780,798

Publications

Børnich, Claus; Grytten, Ivar; Hovig, Eivind et al. (2016) Galaxy Portal: interacting with the galaxy platform through mobile devices. Bioinformatics 32:1743-5

Stoler, Nicholas; Arbeithuber, Barbara; Guiblet, Wilfried et al. (2016) Streamlined analysis of duplex sequencing data with Du Novo. Genome Biol 17:180

Afgan, Enis; Baker, Dannon; van den Beek, Marius et al. (2016) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res 44:W3-W10

Goecks, Jeremy; El-Rayes, Bassel F; Maithel, Shishir K et al. (2015) Open pipelines for integrated tumor genome profiles reveal differences between pancreatic cancer tumors and cell lines. Cancer Med 4:392-403

Harris, Nomi L; Cock, Peter J A; Chapman, Brad A et al. (2015) The Bioinformatics Open Source Conference (BOSC) 2013. Bioinformatics 31:299-300

Blankenberg, Daniel; Taylor, James; Nekrutenko, Anton (2015) Online resources for genomic analysis using high-throughput sequencing. Cold Spring Harb Protoc 2015:324-35

Sauria, Michael Eg; Phillips-Cremins, Jennifer E; Corces, Victor G et al. (2015) HiFive: a tool suite for easy and efficient HiC and 5C data analysis. Genome Biol 16:237

Budd, Aidan; Corpas, Manuel; Brazas, Michelle D et al. (2015) A quick guide for building a successful bioinformatics community. PLoS Comput Biol 11:e1003972

Blankenberg, Daniel; Von Kuster, Gregory; Bouvier, Emil et al. (2014) Dissemination of scientific software with Galaxy ToolShed. Genome Biol 15:403

Blankenberg, Daniel; Johnson, James E; Galaxy Team et al. (2014) Wrangling Galaxy's reference data. Bioinformatics 30:1917-9

Showing the most recent 10 out of 23 publications

Comments

Be the first to comment on James Taylor's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: