DNA sequencing technology has decreased the cost and increased the speed of genome sequencing, and this has led to an exponential growth in the availability of sequenced genomes. In order to take full advantage of the situation, researchers must have easy-to-use but powerful software tools. Unfortunately, it has been a struggle for providers of open-source software to keep pace with the immense volumes of biological data generated on a daily basis. As a result, many researchers are using tools for tasks the software was not meant to perform resulting in less than optimal results. Suitable open-source software tools are needed to process the quantities of genome data and obtain the knowledge needed for scientific breakthroughs. The intellectual merit of the proposed project is that it will fill one significant void, the need for a software program that can be used to cluster millions of protein sequences from sequenced genomes quickly and accurately using cloud computing. From these protein clusters, a phylogenomics approach can be used to predict the function of uncharacterized proteins. Several versions of the software tool pClust have already been developed, and results obtained are far superior to those generated by any other means. For this work a version of pClust will be developed for use in the cloud and with support for incremental clustering. In addition pClust will be used to cluster proteins from all available whole genome sequences for the entire bacterial phylum Proteobacteria (approximately 1770 species presently). This will be the most comprehensive study of the Proteobacteria phylum ever and will represent a scale that has never as yet been achieved.
The broader impacts of this project will occur in several different areas. When made available via cloud computing and a user-friendly graphical user interface, the parallel configuration of the pClust program will allow scientists worldwide to analyze thousands of genomes at once, quickly and accurately clustering all the proteins within the genomes. Deployment of this software tool will have a significant impact on discoveries in science and medicine. In terms of education, both graduate and undergraduate students will be trained. Emphasis will be placed on recruiting and engaging female students into this interdisciplinary project, and the project team will participate in annual outreach programs to middle and high school students. Key research findings originating from this project will be published in peer-reviewed journals and conferences. Also, the software tools developed as part of this project will be published as open source at Google Code, and the results of the clustering, including the protein clusters themselves, phylogenetic trees, phylogenetic profiles, and percentage agreement among the phylogenetic profiles will be made freely available on the WSU School of Electrical Engineering and Computer Science's Bioinformatics and Computational Biology Web site.