We will form a multidisciplinary team of Indiana University computer scientists, biologists, and bioinformaticians to develop and deploy new large-scale computing infrastructure and tools that will enable fundamental health research. Our research will investigate the impact of Cloud computing architectures on large-scale computational biology, particularly widely encountered, """"""""data parallel"""""""" problems including but not limited to DNA sequence analysis. GO funds will be used to establish the new field of Cloud-based computational life science. Cloud computing is currently typified by Amazon Web Services, Microsoft Azure, and other commercial efforts. However, many universities (including Indiana University) are in the process of establishing research Cloud deployments that will address two general problems: Infrastructure: Clouds provide simple Web service programming interfaces that allows scientists to create computing clusters and use highly reliable data storage. That is, Clouds provide a way to outsource computing infrastructure. Runtimes: Cloud systems are particularly appropriate for running large-scale information retrieval problems. These data-parallel problems involve pipelines of replicated, sequential commands that process very large data sets divided into many pieces. Example technologies include Microsoft Dryad and Apache Hadoop. In this proposal, we will partner with Microsoft Research, which is currently converting Dryad from a research project to a robust tool. We have analyzed a wide variety of health research problems and have shown that they can benefit from Cloud infrastructure and runtimes. Clouds provide research groups with a way to outsource computing, storage, and networking and to achieve high performance on data-parallel problems in health research. Our team's research efforts (many NIH funded) represent a wide range of applications, including a) sequence-based transcriptome profiling, b) genome re-sequencing for mutation mapping, c) metagenomics analysis, d) genome annotation, e) comparative genomics, and f) population genomics h) advanced parallel datamining in patient health records. Processing large-scale data is the common problem uniting these efforts.

Public Health Relevance

We propose to investigate and develop a unique Cloud computing research infrastructure that will have a very large impact on several different life science research areas. Our focus is on the large-scale, data-parallel analysis problems that result from the deluge of data from short-read gene sequencing devices and other sources. We will develop and demonstrate our infrastructure in collaboration with several existing biological and biomedical projects.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
High Impact Research and Research Infrastructure Programs (RC2)
Project #
1RC2HG005806-01
Application #
7852166
Study Section
Special Emphasis Panel (ZRG1-GGG-A (99))
Program Officer
Bonazzi, Vivien
Project Start
2009-09-30
Project End
2011-08-31
Budget Start
2009-09-30
Budget End
2010-08-31
Support Year
1
Fiscal Year
2009
Total Cost
$735,042
Indirect Cost
Name
Indiana University Bloomington
Department
Miscellaneous
Type
Other Domestic Higher Education
DUNS #
006046700
City
Bloomington
State
IN
Country
United States
Zip Code
47401
Hawlena, Hadas; Rynkiewicz, Evelyn; Toh, Evelyn et al. (2013) The arthropod, but not the vertebrate host or its environment, dictates bacterial community composition of fleas and ticks. ISME J 7:221-3
Kuehn, Joanna S; Gorden, Patrick J; Munro, Daniel et al. (2013) Bacterial community profiling of milk samples as a means to understand culture-negative bovine clinical mastitis. PLoS One 8:e61959
Hughes, Adam; Ruan, Yang; Ekanayake, Saliya et al. (2012) Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets. BMC Bioinformatics 13 Suppl 2:S9
Wolfe, Alan J; Toh, Evelyn; Shibata, Noriko et al. (2012) Evidence of uncultivated bacteria in the adult female bladder. J Clin Microbiol 50:1376-83
Nelson, David E; Dong, Qunfeng; Van der Pol, Barbara et al. (2012) Bacterial communities of the coronal sulcus and distal urethra of adolescent males. PLoS One 7:e36298
Revanna, Kashi V; Munro, Daniel; Gao, Alvin et al. (2012) A web-based multi-genome synteny viewer for customized data. BMC Bioinformatics 13:190
Dong, Qunfeng; Nelson, David E; Toh, Evelyn et al. (2011) The microbial communities in male first catch urine are highly similar to those in paired urethral swab specimens. PLoS One 6:e19709
Revanna, Kashi V; Chiu, Chi-Chen; Bierschank, Ezekiel et al. (2011) GSV: a web-based genome synteny viewer for customized data. BMC Bioinformatics 12:316
Qiu, Judy; Ekanayake, Jaliya; Gunarathne, Thilina et al. (2010) Hybrid cloud and cluster computing paradigms for life science applications. BMC Bioinformatics 11 Suppl 12:S3