Efficient Algorithms for Motif Search

Rajasekaran, Sanguthevar

Abstract

Multiple genome projects have generated large volumes of DNA, RNA and protein sequence data. While computational pattern searching techniques such as BLAST (a program for measuring sequence similarities) have enabled major discoveries such as the modularity of proteins, much more information can be gained from biological sequence data. However, due to the length and complexity of the patterns, we are limited by the computational algorithms used for motif discovery. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Our group has been instrumental in developing algorithms for these problems. Existing pattern-search algorithms for motif search have two major shortcomings: 1) Approximate algorithms do not always identify the correct pattern, but have the advantage that they can be used to look for short and relatively large patterns in large data sets such as genomes. 2) In contrast, an exact algorithm always identifies the correct pattern, but cannot be used to identify complex data patterns in large datasets. To extract more sophisticated patterns from genomic data we need exact algorithms that can be used to analyze genomes for complex patterns with reasonable computational resources. Exact algorithms are currently limited because the run times of these algorithms are exponentially dependent on the parameters involved. For example, the currently best known algorithms for PMS and EMS (on a PC) are expected to take more than a month for patterns of length 27 and more than 5.67 years for patterns of length 31. In this project, we propose to develop the next generation SMS, PMS and EMS algorithms that identify more complex patterns in larger datasets using less computation time and memory. We also propose to develop a web based system that will incorporate PMS and EMS algorithms. Open source versions of all the algorithms and data developed will be made available to users. In addition, the web system will support online processing of queries that involve the solution of PMS and EMS.

Public Health Relevance

The identification of patterns in biological data is an important problem and due to the length and complexity of the patterns, we are limited by the computational algorithms used for finding these patterns. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Existing algorithms for solving SMS, PMS, and EMS have many shortcomings including the exponential dependence of their run times on the parameters involved, and to overcome these shortcomings we propose to develop advanced algorithms that identify more complex patterns in larger datasets using less computation time and memory.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM010101-02
Application #: 8142235
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Ye, Jane

Project Start: 2010-09-30
Project End: 2014-09-29
Budget Start: 2011-09-30
Budget End: 2012-09-29
Support Year: 2
Fiscal Year: 2011
Total Cost: $355,196
Indirect Cost

Institution

Name: University of Connecticut
Department: Engineering (All Types)
Type: Schools of Engineering
DUNS #: 614209054

City: Storrs-Mansfield
State: CT
Country: United States
Zip Code: 06269

Related projects


NIH 2013 R01 LM	Efficient Algorithms for Motif Search Rajasekaran, Sanguthevar / University of Connecticut	$319,313
NIH 2012 R01 LM	Efficient Algorithms for Motif Search Rajasekaran, Sanguthevar / University of Connecticut	$347,598
NIH 2011 R01 LM	Efficient Algorithms for Motif Search Rajasekaran, Sanguthevar / University of Connecticut	$355,196
NIH 2010 R01 LM	Efficient Algorithms for Motif Search Rajasekaran, Sanguthevar / University of Connecticut	$397,043

Publications

Sharma, Surbhi; Young, Richard J; Chen, Jingchun et al. (2018) Minimotifs dysfunction is pervasive in neurodegenerative disorders. Alzheimers Dement (N Y) 4:414-432

Nicolae, Marius; Rajasekaran, Sanguthevar (2017) On pattern matching with k mismatches and few don't cares. Inf Process Lett 118:78-82

Sharma, Surbhi; Toledo, Oniel; Hedden, Michael et al. (2016) The Functional Human C-Terminome. PLoS One 11:e0152731

Mamun, Abdullah-Al; Pal, Soumitra; Rajasekaran, Sanguthevar (2016) KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 32:2783-90

Pal, Soumitra; Xiao, Peng; Rajasekaran, Sanguthevar (2016) Efficient sequential and parallel algorithms for finding edit distance based motifs. BMC Genomics 17 Suppl 4:465

Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar (2016) Efficient Record Linkage Algorithms Using Complete Linkage Clustering. PLoS One 11:e0154446

Saha, Subrata; Rajasekaran, Sanguthevar (2016) NRGC: a novel referential genome compression algorithm. Bioinformatics 32:3405-3412

Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar (2015) RLT-S: A Web System for Record Linkage. PLoS One 10:e0124449

Nicolae, Marius; Rajasekaran, Sanguthevar (2015) qPMS9: an efficient algorithm for quorum Planted Motif Search. Sci Rep 5:7813

Saha, Subrata; Rajasekaran, Sanguthevar (2015) EC: an efficient error correction algorithm for short reads. BMC Bioinformatics 16 Suppl 17:S2

Showing the most recent 10 out of 46 publications

Comments

Be the first to comment on Sanguthevar Rajasekaran's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: