This grant proposes the development of an extendable, scalable automated data analysis pipeline for functional genomics data. Functional genomics, including microarrays and proteomics, is evolving quickly, with data sets increasing rapidly in size and new analysis methodologies appearing monthly. Because there are no de facto standards for addressing typical experimental questions, the application of multiple analyses is desirable, but rarely performed due to the effort required. Furthermore, the analysis of functional genomics data is generally a multi-step process, with many possible methods in use at each step (e.g., for image analysis, data normalization, statistical analysis, data mining), leading to a combinatorial explosion of effort when using multiple analyses. The functional genomics data pipeline proposed in this application will provide the ability to automatically perform multiple analyses, will provide easy extendibility for adding new functions and data types, will provide a distributed computing environment to provide adequate computational power, and will integrate automated annotation to allow analyses to be guided by biological knowledge. The system will utilize Enterprise Java Beans to provide a robust server architecture, Java server pages for dynamic generation of web interfaces, and object oriented design patterns to optimize the software architecture. The system will be extendable during operation through use of the Strategy design pattern coupled to the Java reflection mechanism. Functional genomics data sets will be encapsulated within data objects that include links to the NCI caBIO objects to utilize the NCI Center for Bioinformatics data resources. In addition, annotations will be retrievable from web sites and through the Distributed Annotation System. Documentation and testing will proceed in parallel with development, and will integrate end users during design and deployment to tune the user interface. The final system will provide dramatic improvements in researchers' abilities to fully explore their growing data sets and to interpret their experimental results in light of the larger biological knowledge bases. It will be fully supported and released to the community open source.
Ochs, Michael F; Casagrande, John T (2008) Information systems for cancer research. Cancer Invest 26:1060-7 |
Kossenkov, Andrew V; Peterson, Aidan J; Ochs, Michael F (2007) Determining transcription factor activity from microarray data using Bayesian Markov chain Monte Carlo sampling. Stud Health Technol Inform 129:1250-4 |
Wang, Guoli; Kossenkov, Andrew V; Ochs, Michael F (2006) LS-NMF: a modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics 7:175 |
Bidaut, Ghislain; Suhre, Karsten; Claverie, Jean-Michel et al. (2006) Determination of strongly overlapping signaling activity from microarray data. BMC Bioinformatics 7:99 |