Inexact simple repeats (both ungapped and gapped) can be quantified as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic Gumbel distribution, with parameters determined by general MAP formulas. The formulas are usually computationally intractable, but an essential simplification in the case of ungapped repeats permits feasible computations. Dr. Spouge's analytic results for ungapped repeats are more detailed than results derived by simulation in G. Achaz et al. (2006) """"""""Repseek, a tool to retrieve approximate repeats from large DNA sequences"""""""". We have used the analytic formulas to provide insight and numerical checks while determining the corresponding statistical parameters for gapped repeats. Dr. Sheetlin implemented the publicly available program RepWords for finding gapped tandem repeats and calculating their statistics. In addition, with a single simple stroke, Drs. Spouge, Sheetlin, and Mario-Ramrez generalized the important linear-time Ruzzo-Tompa algorithm for finding ungapped subsequences of unusual composition to finding gapped subsequences of unusual composition. The generalization includes our repeat-finding algorithms as a special case. In addition, in the publicly available program MsDetector, Drs. Sheetlin and Girgis have implemented repeat-finding techniques using Hidden Markov models.

Project Start
Project End
Budget Start
Budget End
Support Year
5
Fiscal Year
2012
Total Cost
$314,535
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Spouge, John L; Mariño-Ramírez, Leonardo; Sheetlin, Sergey L (2014) Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps. Int J Bioinform Res Appl 10:384-408