Inexact simple repeats (both ungapped and gapped) can be quantified as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic Gumbel distribution, with parameters determined by general MAP formulas. The formulas are usually computationally intractable, but an essential simplification in the case of ungapped repeats permits feasible computations. Dr. Spouge's analytic results for ungapped repeats are more detailed than results derived by simulation in G. Achaz et al. (2006) "Repseek, a tool to retrieve approximate repeats from large DNA sequences". We have used the analytic formulas to provide insight and numerical checks while determining the corresponding statistical parameters for gapped repeats. Dr. Sheetlin implemented the publicly available program RepWords for finding gapped tandem repeats and calculating their statistics. In addition, with a single simple stroke, Drs. Spouge, Sheetlin, and Mario-Ramrez generalized the important linear-time Ruzzo-Tompa algorithm for finding ungapped subsequences of unusual composition to finding gapped subsequences of unusual composition. The generalization includes our repeat-finding algorithms as a special case. Drs. Park, Sheetlin, and Spouge are presently developing programs for public distribution, to calculate threshold scores for statistically significant repeats and to find repeats in biological sequences. The resulting programs will account for composition of skewed genomes of important pathogens like malaria, where repeats aid the pathogen in evading the immune system. In addition, in the publicly available program MsDetector, Drs. Sheetlin and Girgis have implemented repeat-finding techniques using Hidden Markov models.
|Spouge, John L; Mariño-Ramírez, Leonardo; Sheetlin, Sergey L (2014) Searching for repeats, as an example of using the generalised Ruzzo-Tompa algorithm to find optimal subsequences with gaps. Int J Bioinform Res Appl 10:384-408|