This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
Support from the NSF MRI-R2 program allowed the University of Maryland at Baltimore to build the Data Intensive Academic Grid (DIAG) that includes 100 nodes for high-throughput computational analysis and 5 nodes for high-performance computational analysis. This resource will optimize data sets generated by mining the data from public data repositories like Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA) and the National Center for Biotechnology Information and will leverage technologies developed for existing grid resources such as the TeraGrid, and Open Science Grid. The bioinformatics community will access the DIAG using Ergatis, a web based pipeline creation and management tool, bioinformatics oriented Virtual Machines, as well as interactive and programmatic access using technologies such as Nimbus and the Virtual Data Toolkit from the Open Science Grid. The software produced at the DIAG site will be easily utilized by projects such as the Globus Workspaces Project, Open Science Grid and other projects involving large multi-institutional collaborations, as well as providing spillover capacity for two other science grids, located in Illinois and California. The DIAG development will enable training of the next generation of biologists, by bringing a powerful new analysis system to undergraduate, graduate, and post-graduate students in 15 different classroom settings, and the allocation of system time to universities with predominantly under-represented minorities, greatly enhancing the available computation and computing infrastructure for these groups.
The Data Intensive Academic Grid (DIAG) was created as shared computational resource to be used by a set of twenty-two users from over 15 institutions worldwide to conduct bioinformatics analyses and used it as a platform for education and training. When the DIAG was conceived there was increasing awareness and interest in the use of public or private computational clouds such as Amazon EC2 for conducting bioinformatics analyses. So a secondary goal of DIAG was to understand the complexities of operating and conducting bioinformatics analyses in a cloud and ascertain if the computational clouds could become a viable alternative platform for conducting bioinformatics analyses compared to traditional computational grids or dedicated servers. We configured a prototype instrument in 2010 to test the architecture and understand the behavior of the instrument before acquiring the full instrument in 2011 spring. The instrument was initially rolled out in the fall of 2011 for a limited number of initial test users with a complete rollout in December 2011. The DIAG instrument included one-hundred-and-twenty-five high-throughput nodes, five large-memory, high-performance nodes, and over 600 Terabytes of central high-performance shared storage. The production instrument currently supports all four of the proposed access methods, the cloud APIs (Nimbus CloudClient, and Amazon EC2), Ergatis, direct login access via SSH, and as Open Science Grid (OSG) compute element. As of December 2013, DIAG had six-hundred-and-forty-six registered users from three-hundred-and-fifty-four organizations around the world. Of the registered users 371 users used the system at least once. Over the past 21 months of the instruments availability, DIAG was used to complete 7.8 million core hours of computational analysis. The cloud portion accounted for 5.7 million core hours and the shell/Ergatis portions accounted for 2.1 million core hours. The OSG compute element was deployed only recently and therefore has not seen significant use. This overall usage represents an average capacity utilization of 37%. Based on the number of users using a particular access method, data suggest that direct shell or command line access is still the most common access method used by over 306 users, followed by 113 users using Virtual Machines and Cloud APIs in the cloud portion of DIAG. From a capacity utilization perspective the cloud portion of DIAG had an average utilization of 57% compared to 19% overall utilization for the command line access suggesting that the cloud portion is used for longer periods of time by a small number of users. DIAG has been used to analyze data that has resulted in five publications to date [1-5] and according to our survey results it is expected to result in another 30-40 publications. DIAG has also been used in a number of institutions as an educational tool with over 200 students having been trained on DIAG. In addition it has also been used to generate preliminary data for a number of grant submissions. We conducted two user surveys in December 2012 and December 2013 to get feedback from the user community. Overall, the user community reported that it is very satisfied (>80% respondents indicated that they very satisfied or satisfied) with the resource and a number of the users use it on a frequent basis (daily or weekly). The surveys did identify a number of areas for improvements that we addressed over the course of the grant.