We will form a multidisciplinary team of Indiana University computer scientists, biologists, and bioinformaticians to develop and deploy new large-scale computing infrastructure and tools that will enable fundamental health research. Our research will investigate the impact of Cloud computing architectures on large-scale computational biology, particularly widely encountered, """"""""data parallel"""""""" problems including but not limited to DNA sequence analysis. GO funds will be used to establish the new field of Cloud-based computational life science. Cloud computing is currently typified by Amazon Web Services, Microsoft Azure, and other commercial efforts. However, many universities (including Indiana University) are in the process of establishing research Cloud deployments that will address two general problems: Infrastructure: Clouds provide simple Web service programming interfaces that allows scientists to create computing clusters and use highly reliable data storage. That is, Clouds provide a way to outsource computing infrastructure. Runtimes: Cloud systems are particularly appropriate for running large-scale information retrieval problems. These data-parallel problems involve pipelines of replicated, sequential commands that process very large data sets divided into many pieces. Example technologies include Microsoft Dryad and Apache Hadoop. In this proposal, we will partner with Microsoft Research, which is currently converting Dryad from a research project to a robust tool. We have analyzed a wide variety of health research problems and have shown that they can benefit from Cloud infrastructure and runtimes. Clouds provide research groups with a way to outsource computing, storage, and networking and to achieve high performance on data-parallel problems in health research. Our team's research efforts (many NIH funded) represent a wide range of applications, including a) sequence-based transcriptome profiling, b) genome re-sequencing for mutation mapping, c) metagenomics analysis, d) genome annotation, e) comparative genomics, and f) population genomics h) advanced parallel datamining in patient health records. Processing large-scale data is the common problem uniting these efforts.
We propose to investigate and develop a unique Cloud computing research infrastructure that will have a very large impact on several different life science research areas. Our focus is on the large-scale, data-parallel analysis problems that result from the deluge of data from short-read gene sequencing devices and other sources. We will develop and demonstrate our infrastructure in collaboration with several existing biological and biomedical projects.