Due to the rapidly increasing volume of biological data from sequencing, imaging, and other technologies, data processing needs in the Life Sciences are now on par with physical and engineering disciplines. Importantly, the distributed nature of data generation in biology makes this situation even more challenging. Today one can hardly find a research institution or university without multiple high-throughput DNA sequencing machines, and there are often references to a "data crisis" in biology. Federal agencies, and the NSF in particular, are investing heavily in cyberinfrastructure by supporting development of high performance computing (HPC) resources such as the Extreme Science and Engineering Discovery Environment (XSEDE). Yet to a large extent, these resources remain unknown to biological researchers who overwhelmingly continue to rely on fragile in-house computation. The goal of this project is to ensure effective utilization of federal funds that have been invested into development of the national computing infrastructure. This project will extend the Galaxy software platform to leverage existing NSF hardware resources, increasing the value of existing infrastructure for biology researchers that were previously unable to take full advantage of these resources.
This project will follow a comprehensive approach that addresses the needs of experimental scientists, tool developers, and administrators of high performance compute systems (HPC). Access to national compute infrastructure will be expanded so that Galaxy will function as a middleware interface to existing heterogeneous environments such as XSEDE or individual systems such as Jetstream. Software components necessary to optimize Galaxy as a link between researchers and existing HPC will be developed based on pilot projects with the Texas Advanced Computing Center (TACC), XSEDE, PSC, and Indiana University. (2) XSEDE resources to enable interactive data exploration and visualization will be leveraged to expand Galaxy's current capacity for dynamic scientific data analysis. Integration with Interactive Analysis Environments, such as Jupyter or RStudio will allow manipulation and creation of Galaxy datasets using common scripting languages. Taking advantage of XSEDE resources will enable Galaxy's interactive environments and visual analytics to scale to large datasets and sophisticated workflows. (3) Sustainable training and outreach will focus on creating and disseminating curricula that enable investigators to learn skills needed to analyze large datasets. Creation of pre-configured infrastructure components for running workshops and develop modules for undergraduate and graduate face-to-face and on-line classes will expand the current educational portfolio to scale support for increasing numbers of Galaxy users, including disciplines beyond life sciences such as Natural Language Processing. Outcomes of this project will be available at http://galaxyproject.org and https://github.com/galaxyproject.