Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. Here, we describe Machine Learning approaches to improve both accuracy and speed of highly- sensitive sequence alignment. To improve accuracy, we develop methods to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We describe approaches based on both hidden Markov models and Artificial Neural Networks to dramatically reduce these sorts of sequence annotation error. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step. The results of these efforts will be incorporated into forks of the open source sequence alignment tools HMMER, MMSeqs, and (where appropriate) BLAST; we will also work with community developers of annotation pipelines, such as RepeatMasker and IMG/M, to incorporate these approaches. The development and incorporation into these widely used bioinformatics tools will lead to widespread impact on sequence annotation efforts.

Public Health Relevance

Modern molecular biology depends on effective methods for creating sequence alignments quickly and accurately. This proposal describes a plan to develop novel Machine Learning approaches that will dramatically increase the speed of highly-sensitive sequence alignment, and will also address two significant sources of erroneous sequence annotation, (i) the presence of repetitive sequence in biological sequences, and (ii) the tendency for sequence alignment algorithms to extend alignments beyond the boundaries of true homology. The proposed methods represent a mix of applications of hidden Markov models and Artificial Neural Networks, and build on prior success in applying such methods to the problem of sensitive sequence annotation.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM132600-02
Application #
10020995
Study Section
Genetic Variation and Evolution Study Section (GVE)
Program Officer
Ravichandran, Veerasamy
Project Start
2019-09-20
Project End
2023-07-31
Budget Start
2020-08-01
Budget End
2021-07-31
Support Year
2
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Montana
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
010379790
City
Missoula
State
MT
Country
United States
Zip Code
59812