The BLAST programs (BLASTP, PSI-BLAST, etc.) presently use offline computer simulations to give accurate estimates of statistical significance for sequence matches. This project has already speeded those offline computer simulations by a factor of 100-1000. Its eventual aim is to speed the simulations further, so they can be done online over the web. If the project is successful, BLAST users will then be free to use any scores and penalties they choose for matching sequences. There are two parameters in sequence matching statistics: the scale parameter lambda and the pre-factor k. We heuristically derived the new equation for scale parameter lambda. This equation can estimate lambda efficiently with high accuracy. In addition, we have proposed several new formulas for Gumbel pre-factor k based on the path reversal identity and the Poisson clumping heuristic. This formula also provides very accurate results. We also have explored edge effects on the statistics. Edge effects are present because real sequences have limited lengths appear as a correction term in an asymptotic expansion of the probability of sequence matching. This edge effect is likely to be more important in the statistics of matching with gaps than it was in the statistics of matching without gaps, because gapped matches tend to be longer, exhausting the sequences being matched more easily. We now have a working prototype program that calculates lambda and k in 1 second for a wide range of alignment parameters. At least two related publications are planned.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000088-09
Application #
7594467
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
9
Fiscal Year
2007
Total Cost
$185,366
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Spouge, John L (2007) Inequalities on the Overshoot beyond a Boundary for Independent Summands with Differing Distributions. Stat Probab Lett 77:1486-1489
Sheetlin, Sergey; Park, Yonil; Spouge, John L (2005) The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment. Nucleic Acids Res 33:4987-94
Frith, Martin C; Spouge, John L; Hansen, Ulla et al. (2002) Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res 30:3214-24
Park, Yonil; Spouge, John L (2002) The correlation error and finite-size correction in an ungapped sequence alignment. Bioinformatics 18:1236-42
Makalowska, I; Ferlanti, E S; Baxevanis, A D et al. (1999) Histone Sequence Database: sequences, structures, post-translational modifications and genetic loci. Nucleic Acids Res 27:323-4
Wolfsberg, T G; Makalowska, I; Makalowski, W (1999) Genomes and evolution. Web alert. Curr Opin Genet Dev 9:619