Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. Here, we describe Machine Learning approaches to improve both accuracy and speed of highly- sensitive sequence alignment. To improve accuracy, we develop methods to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We describe approaches based on both hidden Markov models and Artificial Neural Networks to dramatically reduce these sorts of sequence annotation error. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step. The results of these efforts will be incorporated into forks of the open source sequence alignment tools HMMER, MMSeqs, and (where appropriate) BLAST; we will also work with community developers of annotation pipelines, such as RepeatMasker and IMG/M, to incorporate these approaches. The development and incorporation into these widely used bioinformatics tools will lead to widespread impact on sequence annotation efforts.
Modern molecular biology depends on effective methods for creating sequence alignments quickly and accurately. This proposal describes a plan to develop novel Machine Learning approaches that will dramatically increase the speed of highly-sensitive sequence alignment, and will also address two significant sources of erroneous sequence annotation, (i) the presence of repetitive sequence in biological sequences, and (ii) the tendency for sequence alignment algorithms to extend alignments beyond the boundaries of true homology. The proposed methods represent a mix of applications of hidden Markov models and Artificial Neural Networks, and build on prior success in applying such methods to the problem of sensitive sequence annotation.