Useful retrieval depends on the ability to predict which documents a user will find helpful in answer to a query. Our interest is the common case when no information is provided about the user other than the query and the query is in natural language. In this setting it is well accepted that a human can make useful predictions in the form of judgments about what will likely prove useful to another human. We present data showing that when the predictions of a group of humans are averaged, the result is a better predictor. If performance is measured as a precision, the group performance increases with the size of the group and approaches a limit of approximately 50% improvement over average individual performance on our data. Superior performance by groups raises the question of how. The groups we studied were subject experts and a natural question was whether the superior performance resulted from the pooling of their subject knowledge. In order to answer this question we studied also a group of untrained individuals. To our surprise we found that while untrained individuals had a somewhat inferior performance to trained individuals, the group of untrained individuals together performed better than any single trained individual and almost at the level of the trained group. In recent work we have revisited this data and shown how to improve the predictions of the group by methods better than simple averaging. This improvement is based on the recognition that different memebers of the group a more or less typical of the average user and weighting of the members to produce a properly weighted average can give a better prediction that a simple average.
|Wilbur, W John; Kim, Won (2011) Improving a gold standard: treating human relevance judgments of MEDLINE document pairs. BMC Bioinformatics 12 Suppl 3:S5|
|Kim, Won; Wilbur, W John (2011) Improving a Gold Standard: Treating Human Relevance Judgments of MEDLINE Document Pairs. Proc Int Conf Mach Learn Appl 2010:491-498|