Wide availability of """"""""next-generation"""""""" sequencing (NGS) instruments has enabled any investigator, for a modest cost, to produce enormous amounts of DNA sequence data. However, working with these raw sequences presents significant problems for individual investigators, small labs, or core facilities. For an experimental group with no computational expertise, simply running a data analysis program is a barrier, let alone building a compute and data storage infrastructure capable of dealing with NGS data. Fortunately, a computational model - """"""""Cloud computing"""""""" - has recently emerged and is ideally suited to the analysis of large- scale sequence data. In this model, computation and storage exist as virtual resources, which can be dynamically allocated and released as needed. Importantly, cloud resources can provide storage and computation at far less cost than dedicated resources for certain use cases. However, formidable challenges need to be addressed to make these resources available to individual investigators. Specifically, although cloud computing provides a way to acquire computational resources on demand, the resources provided are either virtual machines on the Internet or specific programming libraries, which are unusable for experimentalists. Thus, a viable analysis solution needs to be accessible and deployable without informatics expertise;it must efficiently and automatically use dynamically scalable resources, while taking into account time and cost;it must include appropriate analysis tools and easily support addition of new tools as they emerge. We have previously developed a software system - Galaxy (http://galaxyproject.org) - that provides a robust framework for addressing these needs. Here we propose to significantly extend this framework to allow any experimentalist to perform large-scale NGS analyses utilizing the power of cloud computing infrastructure. In particular, we will modify the existing Galaxy framework to run entirely within the cloud. We will adapt the way Galaxy schedules and executes jobs to make effective use of cloud-style. We will provide a mechanism for individual users to create and deploy custom Galaxy instances on a cloud through an entirely web-based interface. Finally, we will test our approach by applying the developed facilities to the existing human re- sequencing data in order to uncover hidden patters of mutations causing human genetic disease on a very large scale.

Public Health Relevance

Project Narrative Increasingly available and inexpensive high-throughput DNA sequencing holds great promise for biomedical research, but informatics challenge block the full realization of the potential of this transformative technology. In particular progress is limited by the informatics and engineering expertise of biomedical researchers, and the availability of sufficient computational infrastructure to analyze these enormous datasets. This project will address these problems by bringing together Galaxy, a system for making complex computational analysis accessible and reproducible, with """"""""cloud computing"""""""", an infrastructure model where computing resources are purchased on demand as needed, making it possible for investigators with no informatics expertise to perform data-intensive analysis using cloud resources.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
High Impact Research and Research Infrastructure Programs (RC2)
Project #
5RC2HG005542-02
Application #
7937844
Study Section
Special Emphasis Panel (ZHG1-HGR-N (O1))
Program Officer
Bonazzi, Vivien
Project Start
2009-09-25
Project End
2012-07-31
Budget Start
2010-08-01
Budget End
2012-07-31
Support Year
2
Fiscal Year
2010
Total Cost
$734,945
Indirect Cost
Name
Emory University
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
066469933
City
Atlanta
State
GA
Country
United States
Zip Code
30322
Børnich, Claus; Grytten, Ivar; Hovig, Eivind et al. (2016) Galaxy Portal: interacting with the galaxy platform through mobile devices. Bioinformatics 32:1743-5
Stoler, Nicholas; Arbeithuber, Barbara; Guiblet, Wilfried et al. (2016) Streamlined analysis of duplex sequencing data with Du Novo. Genome Biol 17:180
Afgan, Enis; Baker, Dannon; van den Beek, Marius et al. (2016) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res 44:W3-W10
Goecks, Jeremy; El-Rayes, Bassel F; Maithel, Shishir K et al. (2015) Open pipelines for integrated tumor genome profiles reveal differences between pancreatic cancer tumors and cell lines. Cancer Med 4:392-403
Harris, Nomi L; Cock, Peter J A; Chapman, Brad A et al. (2015) The Bioinformatics Open Source Conference (BOSC) 2013. Bioinformatics 31:299-300
Blankenberg, Daniel; Taylor, James; Nekrutenko, Anton (2015) Online resources for genomic analysis using high-throughput sequencing. Cold Spring Harb Protoc 2015:324-35
Sauria, Michael Eg; Phillips-Cremins, Jennifer E; Corces, Victor G et al. (2015) HiFive: a tool suite for easy and efficient HiC and 5C data analysis. Genome Biol 16:237
Budd, Aidan; Corpas, Manuel; Brazas, Michelle D et al. (2015) A quick guide for building a successful bioinformatics community. PLoS Comput Biol 11:e1003972
Blankenberg, Daniel; Von Kuster, Gregory; Bouvier, Emil et al. (2014) Dissemination of scientific software with Galaxy ToolShed. Genome Biol 15:403
Blankenberg, Daniel; Johnson, James E; Galaxy Team et al. (2014) Wrangling Galaxy's reference data. Bioinformatics 30:1917-9

Showing the most recent 10 out of 23 publications