Multiple genome projects have generated large volumes of DNA, RNA and protein sequence data. While computational pattern searching techniques such as BLAST (a program for measuring sequence similarities) have enabled major discoveries such as the modularity of proteins, much more information can be gained from biological sequence data. However, due to the length and complexity of the patterns, we are limited by the computational algorithms used for motif discovery. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Our group has been instrumental in developing algorithms for these problems. Existing pattern-search algorithms for motif search have two major shortcomings: 1) Approximate algorithms do not always identify the correct pattern, but have the advantage that they can be used to look for short and relatively large patterns in large data sets such as genomes. 2) In contrast, an exact algorithm always identifies the correct pattern, but cannot be used to identify complex data patterns in large datasets. To extract more sophisticated patterns from genomic data we need exact algorithms that can be used to analyze genomes for complex patterns with reasonable computational resources. Exact algorithms are currently limited because the run times of these algorithms are exponentially dependent on the parameters involved. For example, the currently best known algorithms for PMS and EMS (on a PC) are expected to take more than a month for patterns of length 27 and more than 5.67 years for patterns of length 31. In this project, we propose to develop the next generation SMS, PMS and EMS algorithms that identify more complex patterns in larger datasets using less computation time and memory. We also propose to develop a web based system that will incorporate PMS and EMS algorithms. Open source versions of all the algorithms and data developed will be made available to users. In addition, the web system will support online processing of queries that involve the solution of PMS and EMS.

Public Health Relevance

The identification of patterns in biological data is an important problem and due to the length and complexity of the patterns, we are limited by the computational algorithms used for finding these patterns. Simple Motif Search (SMS), Planted Motif Search (PMS) and Edit-distance Motif Search (EMS) are the three principal paradigms that have been previously used for identifying short functional peptide motifs, transcriptional regulatory elements, composite regulatory patterns, DNA motifs, similarity between families of proteins, etc. Existing algorithms for solving SMS, PMS, and EMS have many shortcomings including the exponential dependence of their run times on the parameters involved, and to overcome these shortcomings we propose to develop advanced algorithms that identify more complex patterns in larger datasets using less computation time and memory.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM010101-03
Application #
8324284
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
2010-09-30
Project End
2014-09-29
Budget Start
2012-09-30
Budget End
2013-09-29
Support Year
3
Fiscal Year
2012
Total Cost
$347,598
Indirect Cost
$77,992
Name
University of Connecticut
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
614209054
City
Storrs-Mansfield
State
CT
Country
United States
Zip Code
06269
Kundeti, Vamsi; Rajasekaran, Sanguthevar; Dinh, Hieu (2014) Border length minimization problem on a square array. J Comput Biol 21:446-55
Rajasekaran, Sanguthevar; Nicolae, Marius (2014) An Elegant Algorithm for the Construction of Suffix Arrays. J Discrete Algorithms (Amst) 27:21-28
Saha, Subrata; Rajasekaran, Sanguthevar (2014) Efficient and scalable scaffolding using optical restriction maps. BMC Genomics 15 Suppl 5:S5
Li, Junjie; Ranka, Sanjay; Sahni, Sartaj (2014) Multicore and GPU algorithms for Nussinov RNA folding. BMC Bioinformatics 15 Suppl 8:S1
Nicolae, Marius; Rajasekaran, Sanguthevar (2014) Efficient sequential and parallel algorithms for planted motif search. BMC Bioinformatics 15:34
Mamun, Abdullah-Al; Mi, Tian; Aseltine, Robert et al. (2014) Efficient sequential and parallel algorithms for record linkage. J Am Med Inform Assoc 21:252-62
Saha, Subrata; Rajasekaran, Sanguthevar; Bi, Jinbo et al. (2013) Efficient techniques for genotype-phenotype correlational analysis. BMC Med Inform Decis Mak 13:41
Mi, Tian; Rajasekaran, Sanguthevar (2013) Efficient algorithms for biological stems search. BMC Bioinformatics 14:161
Mi, Tian; Merlin, Jerlin Camilus; Deverasetty, Sandeep et al. (2012) Minimotif Miner 3.0: database expansion and significantly improved reduction of false-positive predictions from consensus sequences. Nucleic Acids Res 40:D252-60
Merlin, Jerlin C; Rajasekaran, Sanguthevar; Mi, Tian et al. (2012) Reducing false-positive prediction of minimotifs with a genetic interaction filter. PLoS One 7:e32630

Showing the most recent 10 out of 14 publications