We propose to capitalize on success of ongoing collaboration between the bioinformatics teams at the University of Greifswald (Germany) and at the Georgia Institute of Technology (USA) and address open challenges in computational genome annotation. In the course of this development, we plan to implement new algorithmic ideas and satisfy the needs of unbiased integration of different types of OMICS data. We plan to address one of the long-standing problems at interface of bioinformatics and machine learning ? automatic generative and discriminative parameterization of gene finding algorithms. Current methods of combining OMICS evidence frequently result in under predicting or over predicting tools. Having good understanding of the difficulties and the properties of different types of OMICS evidence we propose an optimized approach to the full unsupervised, generative and discriminative training. We will introduce novel means to optimize integration of multiple OMICS evidence into gene prediction. These ideas will develop further the protein family-based gene finding implemented in AUGUSTUS-PPX. We propose to create representations of protein families for gene finding that for the first time include cross-species gene structure information. We will develop a new approach that will unify two advanced research areas - transcript reconstruction from RNA-Seq and statistical gene finding that integrates RNA-Seq and homology information. We will describe a new, comprehensive model and EM-like algorithmic technique (the ?wholistic? approach) to identify the sets of transcripts and their expression levels that best fit the available OMICS evidence. We will also develop an automatic gene-finding algorithm for a full content of metagenomes including eukaryotic and viral metagenomic sequences. This task is conventionally considered too challenging. We propose a solution exploiting and advancing algorithmic ideas and approaches that we mastered in the course of creating gene finders for prokaryotic metagenomes as well as eukaryotic genomes. All new tools will be available to the community under open source licenses.
The goal of this project is to advance the science of genome interpretation by developing much needed computational methods and tools for high precision annotation of eukaryotic genomes and metagenomes. This advance will make an impact in research on model and non-model organisms including important human pathogens, parasites and viruses. New high throughput technologies generate volumes of sequence data on complex genomes as well as metagenomes. Still these big data volumes have to be transformed into scientific knowledge. Our new bioinformatics tools, matching the latest sequencing technology in speed and performance, will make a significant impact in genomic research aiming at ultimate understanding of human health and disease.