Machine learning approaches for improved accuracy and speed in sequence annotation

Wheeler, Travis

Abstract

Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. Here, we describe Machine Learning approaches to improve both accuracy and speed of highly- sensitive sequence alignment. To improve accuracy, we develop methods to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We describe approaches based on both hidden Markov models and Artificial Neural Networks to dramatically reduce these sorts of sequence annotation error. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step. The results of these efforts will be incorporated into forks of the open source sequence alignment tools HMMER, MMSeqs, and (where appropriate) BLAST; we will also work with community developers of annotation pipelines, such as RepeatMasker and IMG/M, to incorporate these approaches. The development and incorporation into these widely used bioinformatics tools will lead to widespread impact on sequence annotation efforts.

Public Health Relevance

Modern molecular biology depends on effective methods for creating sequence alignments quickly and accurately. This proposal describes a plan to develop novel Machine Learning approaches that will dramatically increase the speed of highly-sensitive sequence alignment, and will also address two significant sources of erroneous sequence annotation, (i) the presence of repetitive sequence in biological sequences, and (ii) the tendency for sequence alignment algorithms to extend alignments beyond the boundaries of true homology. The proposed methods represent a mix of applications of hidden Markov models and Artificial Neural Networks, and build on prior success in applying such methods to the problem of sensitive sequence annotation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 1R01GM132600-01A1
Application #: 9887588
Study Section: Genetic Variation and Evolution Study Section (GVE)
Program Officer: Ravichandran, Veerasamy

Project Start: 2019-09-20
Project End: 2023-07-31
Budget Start: 2019-09-20
Budget End: 2020-07-31
Support Year: 1
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: University of Montana
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 010379790

City: Missoula
State: MT
Country: United States
Zip Code: 59812

Related projects


NIH 2020 R01 GM	Machine learning approaches for improved accuracy and speed in sequence annotation Wheeler, Travis John / University of Montana
NIH 2019 R01 GM	Machine learning approaches for improved accuracy and speed in sequence annotation Wheeler, Travis John / University of Montana

Comments

Be the first to comment on Travis Wheeler's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: