Multiple genome projects have generated large volumes of DNA, RNA and protein sequence data. While computational pattern searching techniques such as BLAST (a program for measuring sequence similarities) have enabled major discoveries such as the modularity of proteins, much more information can be gained from biological sequence data. However, due to the length and complexity of the patterns, we are limited by the computational algorithms used for motif discovery. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Our group has been instrumental in developing algorithms for these problems. Existing pattern-search algorithms for motif search have two major shortcomings: 1) Approximate algorithms do not always identify the correct pattern, but have the advantage that they can be used to look for short and relatively large patterns in large data sets such as genomes. 2) In contrast, an exact algorithm always identifies the correct pattern, but cannot be used to identify complex data patterns in large datasets. To extract more sophisticated patterns from genomic data we need exact algorithms that can be used to analyze genomes for complex patterns with reasonable computational resources. Exact algorithms are currently limited because the run times of these algorithms are exponentially dependent on the parameters involved. For example, the currently best known algorithms for PMS and EMS (on a PC) are expected to take more than a month for patterns of length 27 and more than 5.67 years for patterns of length 31. In this project, we propose to develop the next generation SMS, PMS and EMS algorithms that identify more complex patterns in larger datasets using less computation time and memory. We also propose to develop a web based system that will incorporate PMS and EMS algorithms. Open source versions of all the algorithms and data developed will be made available to users. In addition, the web system will support online processing of queries that involve the solution of PMS and EMS.
The identification of patterns in biological data is an important problem and due to the length and complexity of the patterns, we are limited by the computational algorithms used for finding these patterns. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Existing algorithms for solving SMS, PMS, and EMS have many shortcomings including the exponential dependence of their run times on the parameters involved, and to overcome these shortcomings we propose to develop advanced algorithms that identify more complex patterns in larger datasets using less computation time and memory.
Sharma, Surbhi; Young, Richard J; Chen, Jingchun et al. (2018) Minimotifs dysfunction is pervasive in neurodegenerative disorders. Alzheimers Dement (N Y) 4:414-432 |
Nicolae, Marius; Rajasekaran, Sanguthevar (2017) On pattern matching with k mismatches and few don't cares. Inf Process Lett 118:78-82 |
Sharma, Surbhi; Toledo, Oniel; Hedden, Michael et al. (2016) The Functional Human C-Terminome. PLoS One 11:e0152731 |
Mamun, Abdullah-Al; Pal, Soumitra; Rajasekaran, Sanguthevar (2016) KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 32:2783-90 |
Pal, Soumitra; Xiao, Peng; Rajasekaran, Sanguthevar (2016) Efficient sequential and parallel algorithms for finding edit distance based motifs. BMC Genomics 17 Suppl 4:465 |
Mamun, Abdullah-Al; Aseltine, Robert; Rajasekaran, Sanguthevar (2016) Efficient Record Linkage Algorithms Using Complete Linkage Clustering. PLoS One 11:e0154446 |
Saha, Subrata; Rajasekaran, Sanguthevar (2016) NRGC: a novel referential genome compression algorithm. Bioinformatics 32:3405-3412 |
Hamdalla, Mai A; Ammar, Reda A; Rajasekaran, Sanguthevar (2015) A molecular structure matching approach to efficient identification of endogenous mammalian biochemical structures. BMC Bioinformatics 16 Suppl 5:S11 |
Saha, Subrata; Rajasekaran, Sanguthevar (2015) ERGC: an efficient referential genome compression algorithm. Bioinformatics 31:3468-75 |
Nicolae, Marius; Pathak, Sudipta; Rajasekaran, Sanguthevar (2015) LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 31:3276-81 |
Showing the most recent 10 out of 46 publications