Metagenomic sequencing projects generate thousands to millions of uncharacterized microbial genes that are almost completely ignored in all fields of research. Addressing this problem will fundamentally transform how scientists exploring microbial communities or new microbial isolates will interpret their genetic material and the function of that material. This potentially high payoff is balanced by a high risk in that microbial community information has not previously been mined in order to address this issue. In the absence of more extensive preliminary data, or one or more years of prior validation, this necessitates the application of previously untried approaches to prioritize and characterize the targeted microbial genes. Lastly, while the downstream methods to be applied here for gene function prediction will be adapted from eukaryotic model systems, this will require both application in a completely new area (culture-independent prokaryotes) and the intersection of multiple disciplines (computational gene function prediction, data integration, and network mining with microbial community studies and microbiology).
Current technologies generate novel nucleotide sequence information at a rate that greatly outpaces our capability to functionally characterize those sequences. From one third to more typically over three quarters of proteins in newly-sequenced prokaryotic genomes and communities cannot be functionally characterized. The increase in metagenomic sequencing results in millions of recently identified, completely uncharacterized microbial genes representing a significant need for efficient computational gene prioritization and characterization systems. This project will first leverage metagenomic sequences in a novel effort to prioritize the uncharacterized genes for further study in order to break from current approaches targeting genes from well-studied gene families. Second, integrative, network-based approaches will be used to accelerate and automate the assignment of putative function for subsequent validation in high-priority gene targets. Both new approaches will be implemented as freely available, documented software and distributed to the broader research community along with pilot datasets. A postdoctoral fellow, a graduate student and undergraduate students will receive cutting edge training in integrative experimental and computational approaches during the two-year project.