Multiple genome projects have generated large volumes of DNA, RNA and protein sequence data. While computational pattern searching techniques such as BLAST (a program for measuring sequence similarities) have enabled major discoveries such as the modularity of proteins, much more information can be gained from biological sequence data. However, due to the length and complexity of the patterns, we are limited by the computational algorithms used for motif discovery. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Our group has been instrumental in developing algorithms for these problems. Existing pattern-search algorithms for motif search have two major shortcomings: 1) Approximate algorithms do not always identify the correct pattern, but have the advantage that they can be used to look for short and relatively large patterns in large data sets such as genomes. 2) In contrast, an exact algorithm always identifies the correct pattern, but cannot be used to identify complex data patterns in large datasets. To extract more sophisticated patterns from genomic data we need exact algorithms that can be used to analyze genomes for complex patterns with reasonable computational resources. Exact algorithms are currently limited because the run times of these algorithms are exponentially dependent on the parameters involved. For example, the currently best known algorithms for PMS and EMS (on a PC) are expected to take more than a month for patterns of length 27 and more than 5.67 years for patterns of length 31. In this project, we propose to develop the next generation SMS, PMS and EMS algorithms that identify more complex patterns in larger datasets using less computation time and memory. We also propose to develop a web based system that will incorporate PMS and EMS algorithms. Open source versions of all the algorithms and data developed will be made available to users. In addition, the web system will support online processing of queries that involve the solution of PMS and EMS.

Public Health Relevance

The identification of patterns in biological data is an important problem and due to the length and complexity of the patterns, we are limited by the computational algorithms used for finding these patterns. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Existing algorithms for solving SMS, PMS, and EMS have many shortcomings including the exponential dependence of their run times on the parameters involved, and to overcome these shortcomings we propose to develop advanced algorithms that identify more complex patterns in larger datasets using less computation time and memory.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM010101-03
Application #
8324284
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
2010-09-30
Project End
2014-09-29
Budget Start
2012-09-30
Budget End
2013-09-29
Support Year
3
Fiscal Year
2012
Total Cost
$347,598
Indirect Cost
$77,992
Name
University of Connecticut
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
614209054
City
Storrs-Mansfield
State
CT
Country
United States
Zip Code
06269
Sharma, Surbhi; Young, Richard J; Chen, Jingchun et al. (2018) Minimotifs dysfunction is pervasive in neurodegenerative disorders. Alzheimers Dement (N Y) 4:414-432
Nicolae, Marius; Rajasekaran, Sanguthevar (2017) On pattern matching with k mismatches and few don't cares. Inf Process Lett 118:78-82
Sharma, Surbhi; Toledo, Oniel; Hedden, Michael et al. (2016) The Functional Human C-Terminome. PLoS One 11:e0152731
Mamun, Abdullah-Al; Pal, Soumitra; Rajasekaran, Sanguthevar (2016) KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 32:2783-90
Pal, Soumitra; Xiao, Peng; Rajasekaran, Sanguthevar (2016) Efficient sequential and parallel algorithms for finding edit distance based motifs. BMC Genomics 17 Suppl 4:465
Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar (2016) Efficient Record Linkage Algorithms Using Complete Linkage Clustering. PLoS One 11:e0154446
Saha, Subrata; Rajasekaran, Sanguthevar (2016) NRGC: a novel referential genome compression algorithm. Bioinformatics 32:3405-3412
Hamdalla, Mai A; Ammar, Reda A; Rajasekaran, Sanguthevar (2015) A molecular structure matching approach to efficient identification of endogenous mammalian biochemical structures. BMC Bioinformatics 16 Suppl 5:S11
Saha, Subrata; Rajasekaran, Sanguthevar (2015) ERGC: an efficient referential genome compression algorithm. Bioinformatics 31:3468-75
Nicolae, Marius; Pathak, Sudipta; Rajasekaran, Sanguthevar (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31:3276-81

Showing the most recent 10 out of 46 publications