A High-memory Supercomputer for Proteomics, Text Mining and Microbiome Research

Knight, Robin

Abstract

We request funds to purchase an integrated supercomputer to unite 5 highly productive and collaborative laboratories with complementary expertise in the microbiome, proteomics, text mining, and supercomputing, and to extend these capabilities to the broader NIH-funded biomedical research community via cloud and web applications. The critical shared need not met by other systems on campus, unavailable in commercial clouds, and oversubscribed at national labs, is for a system that can run jobs that require high memory (8-32 GB/core) and long duration (>2 weeks wall-time), and is optimized for high-IO tasks that saturate network or storage on other systems. The system will consist of 128 servers, each using 2x8-core 2.93GHz Intel Sandybridge CPUs. 20 large-memory nodes will each have 512GB of RAM (32GB/core), and 100 compute nodes will each have 128GB of RAM (8GB/core). These 120 nodes will each use two 10Gbps Ethernet ports bonded together for a 20Gbps/node (2.5GB/s) connection to the rest of the system, and each node will have 2.4TB raw high- performance local storage. The total aggregate performance of these local disks is over 36GB/s sustained (>300MB/s per node). The remaining 8 nodes will be used for administration, support for advanced software tools and infrastructure, and user interaction. A central high-performance Lustre parallel file system will provide 1.15PB of usable scratch space and sustain 36GB/s to the 128 clients. An archival system of 4 drives/300 tapes will sustain >1GB/s aggregate (accounting for compression), provide 450TB of raw capacity, store ~4.5 PB of user data, and scale to 5x this size. The system, valued at $4.5 million but quoted at $2 million by HP due to the strategic importance of this partnership, will be housed in a state-of-the art machine room in the new Jennie Smoly Caruthers Biotechnology Building on the Boulder campus (opening Feb 2012), and connect to the rest of the campus at 40Gbps. The system will be a key enabling technology for key scientific areas where data growth is exponential and current systems on campus are end-of-life, solely dedicated to other purposes, or optimized for other tasks. The major users will use the instrument largely for time-consuming one-time tasks such as parameter optimization for microbiome and genome assembly workflows, building knowledgebases, and performing simulations and database searches that will provide resources that are re-used by much broader user communities (hundreds of collaborators;thousands of end users) who lack supercomputing access. One key innovative aspect of this proposal is configuration of part of the system as an academic cloud, which will allow us to pilot workflows that can later be deployed by diverse users on commercial clouds (e.g. Amazon EC2) and academic clouds (e.g. Magellan and DIAG) once those clouds are upgraded. The system will also build a broad expertise base in high-performance computing in the life sciences through outreach to promising new faculty and trainees on NIH training grants, and collaborations with new users of the Sequencing Core. The proposed instrument will thus have a profound impact on NIH-funded research.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: Office of The Director, National Institutes of Health (OD)
Type: Biomedical Research Support Shared Instrumentation Grants (S10)
Project #: 1S10OD012300-01
Application #: 8334437
Study Section: Special Emphasis Panel (ZRG1-BST-F (30))
Program Officer: Levy, Abraham

Project Start: 2013-04-22
Project End: 2014-04-21
Budget Start: 2013-04-22
Budget End: 2014-04-21
Support Year: 1
Fiscal Year: 2013
Total Cost: $1,900,000
Indirect Cost

Institution

Name: University of Colorado at Boulder
Department: Chemistry
Type: Schools of Arts and Sciences
DUNS #: 007431505

City: Boulder
State: CO
Country: United States
Zip Code: 80309

Publications

Tripodi, Ignacio J; Allen, Mary A; Dowell, Robin D (2018) Detecting Differential Transcription Factor Activity from ATAC-Seq Data. Molecules 23:

Fulbright, Scott P; Robbins-Pianka, Adam; Berg-Lyons, Donna et al. (2018) Bacterial community changes in an industrial algae production system. Algal Res 31:147-156

Azofeifa, Joseph G; Allen, Mary A; Hendrix, Josephina R et al. (2018) Enhancer RNA profiling predicts transcription factor activity. Genome Res :

Azofeifa, Joseph G; Dowell, Robin D (2017) A generative model for the behavior of RNA polymerase. Bioinformatics 33:227-234

Scott, Amber L; Richmond, Phillip A; Dowell, Robin D et al. (2017) The Influence of Polyploidy on the Evolution of Yeast Grown in a Sub-Optimal Carbon Source. Mol Biol Evol 34:2690-2703

Lladser, Manuel E; Azofeifa, Joseph G; Allen, Mary A et al. (2017) RNA Pol II transcription model and interpretation of GRO-seq data. J Math Biol 74:77-97

Stefferson, Michael W; Norris, Samantha L; Vernerey, Franck J et al. (2017) Effects of soft interactions and bound mobility on diffusion in crowded environments: a model of sticky and slippery obstacles. Phys Biol 14:045008

Azofeifa, Joseph G; Allen, Mary A; Lladser, Manuel E et al. (2017) An Annotation Agnostic Algorithm for Detecting Nascent RNA Transcripts in GRO-Seq. IEEE/ACM Trans Comput Biol Bioinform 14:1070-1081

Blackwell, Robert; Edelmaier, Christopher; Sweezy-Schindler, Oliver et al. (2017) Physical determinants of bipolar mitotic spindle assembly and stability in fission yeast. Sci Adv 3:e1601603

Dowell, Robin; Odell, Aaron; Richmond, Phillip et al. (2016) Genome characterization of the selected long- and short-sleep mouse lines. Mamm Genome 27:574-586

Showing the most recent 10 out of 14 publications

Comments

Be the first to comment on Robin Knight's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Publications

Comments