Genome sequencing projects have revealed frequent gains and losses of genes between species. These changes have been shown to be responsible for morphological, physiological, and behavioral differences, and to contribute to the diversity observed in nature. Advances in sequencing technology are making new genome data available at faster rates than ever before. As the number of species with sequenced genomes grows, so will the number of researchers wanting to take advantage of these valuable resources. They will come from a wide range of biological fields, and have an equally wide range of experience with computational tools. CAFE (Computational Analysis of gene Family Evolution) is a software package that allows researchers to better understand rates of gene gain and loss. This project will result in a version of CAFE that adds to the national infrastructure by enabling new biological discoveries to the benefit of scientists working in many fields. CAFE will be a useful tool in science education, and will also improve and accelerate biological research that can be expected to have multiple societal benefits, including understanding the genetic basis for important biological phenotypes. A vigorous outreach and information dissemination plan will ensure that researchers and faculty engaged in research education are aware of CAFE and able to use it effectively, and will promote the development of a technology-savvy 21st century biology research community.
Studies of gene families are essential to a number of research areas, including gene regulation, human disease, and evolutionary genomics. CAFE enables these and other studies into cutting-edge areas by providing a likelihood method for analyzing gene gain and loss over a phylogeny. This method has been shown to work well with the error-prone genome assemblies currently available for most organisms, as well as when analyzing dozens of genomes at a time. This project will extend these capabilities to hundreds or thousands of genomes. To accomplish this goal, several of the maximum likelihood methodologies implemented by CAFE will be re-designed. These changes will include allowing rate variation among gene families, optimizing likelihood calculations on trees, and improving specification of several probability distributions used by these calculations. The quality of the code will be enhanced through best practices in software engineering and the development of better, faster, and more scalable supercomputer versions of the software. All software will be available at www.indiana.edu/~hahnlab/.