Bioinformatic data sets are large and complicated. Marshalling and managing necessary resources (e.g., hardware; computer and programmer time) requires significant skill. Effective analysis and comprehension involves sophisticated statistical understanding. Domains of application and available data types change rapidly, requiring flexible and familiar programming environments. Collaborations involve diverse research groups of heterogeneous size and expertise. This project develops and disseminates new and efficient approaches to solving present and emerging problems in statistical analysis and interpretation of very large data. The project combines the strengths of two very widely used and complementary bioinformatics projects, Bioconductor and Galaxy.
The project has three components. The first, providing scalable access, develops R programming paradigms appropriate for scalable analysis. R/Bioconductor software will be developed for efficient reduction of large data to statistical descriptions by iterating data through transformation kernels. Bioconductor will be deployed for use in an accessible cloud-based environment, and will be integrated into the Galaxy deployment scheme. The second component is to provide statistical methods for big genomic data bydeveloping high performance statistical methodologies for analysis of large bioinformatics data. This applies the initial technical achievements to specific requirements of statistical analysis in genomics. Domains of application include: quality assessment and normalization of very large raw data; data reduction and uncertainty measure calculation for downstream interrogation; and discovery, reporting and auditing of novel biological findings. Developments require novel computational approaches that avoid all-data-in-memory computational models (prevalent in current algorithm implementations), and that re-express monolithic algorithms as concurrently executable independent components. This emphasizes extensible and composable elements to yield a richer toolkit for statistical genomics. The aim leverages R?s strength as a language for rapid development of statistical methodologies, and emphasizes areas of proven strength in the Bioconductor project. The third component addresses decision making. This aspect provides integration of R / Bioconductor work flows into Galaxy. We will deploy key results from Aim 2 as Galaxy work flows. New real-time feedback for streaming analytics will be introduced to Galaxy, and leveraged by Bioconductor.
The project includes very significant capacity building. The Bioconductor project successfully solicits, tests, and disseminates over 600 R packages for the statistical analysis and comprehension of high-throughput genomic data. All packages include extensive documentation, including vignettes describing intent, function, and interoperability. Packages reflect contributions from a broad scientific community, and enable national and international graduate, post-graduate, and commercial research activities in statistical, bioinformatic, and computational domains. This project furthers the capacity building impact of Bioconductor by addressing memory and performance limitations to statistical analysis of large and complicated bioinformatic data. Galaxy enables broad access to computational resources for data intensive biomedical research. This project enhances the capacity building impacts of Galaxy by providing scalable processing of big bioinformatic data, and enabling exploratory analysis by a broad bioinformatic community. The coupling of Bioconductor and Galaxy provides significant synergy, facilitating rapid translation of statistical and bioinformatic research developed in R to broad use through Galaxy.