With the growth of biological information, the efficiency of database retrieval has become central to the biological enterprise. In particular, one can change a retrieval method and must be able to evaluate whether the change is an improvement or not. Initially, using U-statistics, we developed central limit theorems describing the behavior of the receiver operating characteristic curve n (ROCn) under bootstrapping. Our methodology was applied to determine which changes to the PSI-BLAST program actually constitute improvements. Eventually, however, we rejected the ROCn as an unacceptable measure of database retrieval efficacy for bioinformatics, substituting in its place the TAPk. By measuring the retrieval efficacy for each query (which the ROCn cannot always do), the TAPk permits metrics on retrieval methods, to determine how closely related two retrieval methods are by their behavior query by query. The metrics can distinguish, e.g., a "tweak" on an accepted retrieval algorithm (which produces retrieval "close" to the algorithm's) from a truly novel algorithm (which produces a "distant" retrieval), thereby rewarding originality in bioinformatics, by objectively displaying incremental improvements of existing algorithms for what they are. Citation databases show that the TAPk is receiving attention in bioinformatics. Presently, Drs. Spouge and Carroll are applying the TAPk to evaluate variations in BLAST retrieval replacing E-values with false discovery rates.
|Carroll, Hyrum D; Williams, Alex C; Davis, Anthony G et al. (2015) Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate. IEEE/ACM Trans Comput Biol Bioinform 12:531-7|
|Carroll, Hyrum D; Kann, Maricel G; Sheetlin, Sergey L et al. (2010) Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics. Bioinformatics 26:1708-13|