The NCBI CoreTools now contains code from us, code that calculates to practical accuracies, and in less than 1 sec, all parameters of the modified Gumbel distribution (the Gumbel scale parameter, λ, pre-factor k, and finite-size correction). The BLAST group has implemented our faster calculations to generate the modified Gumbel parameters for several new DNA scoring schemes and incorporated our finite-size correction into BLAST. Our code enables a real-time composition correction to BLAST searches, so it solves a fundamental open bioinformatics problem, one recognized by researchers since the late 90's. Implemented as a switch in BLAST for compositionally-skewed queries, the code could improve retrieval results for medically important queries such as malaria and tuberculosis, with arbitrarily small costs in computational speed. BLAST searches have yet to incorporate our code for real-time composition corrections, however. Our collaboration with Dr. Martin Frith has extended our methods to next-generation sequence matching, including frameshifts in DNA, a subject of relevance to the NCBI BLAST services. In a practical test of our methods, our frameshift statistics approximately doubled the number of known human pseudogenes. Dr. Frith has incorporated our program FALP into his genomic alignment program, LAST, so that it can handle frameshifts in next-generation sequence matches. We are currently preparing a C++ library, so any local alignment tool written in C++ may incorporate our statistical methods and use arbitrary scoring systems and letter abundances. We have made our results available to the NCBI BLAST group.

Project Start
Project End
Budget Start
Budget End
Support Year
17
Fiscal Year
2015
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Gauran, Iris Ivy M; Park, Junyong; Lim, Johan et al. (2018) Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data. Biometrics 74:458-471
Sheetlin, Sergey; Park, Yonil; Frith, Martin C et al. (2016) ALP & FALP: C++ libraries for pairwise local alignment E-values. Bioinformatics 32:304-5
Carroll, Hyrum D; Williams, Alex C; Davis, Anthony G et al. (2015) Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate. IEEE/ACM Trans Comput Biol Bioinform 12:531-7
Sheetlin, Sergey L; Park, Yonil; Frith, Martin C et al. (2014) Frameshift alignment: statistics and post-genomic applications. Bioinformatics :
Park, Yonil; Sheetlin, Sergey; Ma, Ning et al. (2012) New finite-size correction for local alignment score distributions. BMC Res Notes 5:286
Sheetlin, Sergey; Park, Yonil; Spouge, John L (2011) Objective method for estimating asymptotic parameters, with an application to sequence alignment. Phys Rev E Stat Nonlin Soft Matter Phys 84:031914
Park, Yonil; Sheetlin, Sergey; Spouge, John L (2009) ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. Ann Stat 37:3697