With the growth of biological information, the efficiency of database retrieval has become central to the biological enterprise. In particular, one can change a retrieval method and must be able to evaluate whether the change is an improvement or not. We are developing methods based on the statistical bootstrap to assign statistical significance to improvements in database retrieval. The methods are based on mathematical central limit theorems describing the behavior of the receiver operating characteristic curve n under bootstrapping. We now have theorems relating the curve to U-statistics, providing a ready mathematical framework for developing the central limit threorems the theory requires. In particular, our methodology has already been applied to determine which changes to the PSI-BLAST program actually constitute improvements. In addition, we are investigating """"""""isotonicity"""""""" of relevance in retrieval, the assumption that after rankwise averaging of relevance, records are retrieved on average in decreasing order of relevance. The isotonic assumption affects the evaluation of retrieval efficiency, and preliminary results indicate that despite its widespread adoption, the assumption can be wrong. We are also exploring the possibility of placing metrics on retrieval methods, to determine how closely related two retrieval methods are. The metrics could distinguish, e.g., a """"""""tweak"""""""" on an accepted retrieval algorithm (which produces retrieval """"""""close"""""""" to the algorithm's) from a truly novel algorithm (which produces a """"""""distant"""""""" retrieval).

Project Start
Project End
Budget Start
Budget End
Support Year
8
Fiscal Year
2010
Total Cost
$78,348
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Carroll, Hyrum D; Williams, Alex C; Davis, Anthony G et al. (2015) Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate. IEEE/ACM Trans Comput Biol Bioinform 12:531-7
Carroll, Hyrum D; Kann, Maricel G; Sheetlin, Sergey L et al. (2010) Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics. Bioinformatics 26:1708-13