Search corpora are growing larger and larger: over the last 10 years, the IR research community has moved from the several hundred thousand documents on the TREC disks to the tens of millions of U.S. government web pages of GOV2 to the one billion general-interest web pages in the new ClueWeb09 collection. But traditional means of acquiring relevance judgments and evaluating -- e.g. pooling documents to calculate average precision -- do not seem to scale well to these new large collections. They require substantially more cost in human assessments for the same reliability in evaluation; if the additional cost goes over the assessing budget, errors in evaluation are inevitable.
The goal of this tutorial was to provide attendees with a comprehensive overview of techniques to perform low cost (in terms of judgment effort) evaluation. The slides cover a number of topics including alternatives to pooling, evaluation measures robust to incomplete judgments, evaluating with no relevance judgments, statistical inference of evaluation metrics, inference of relevance judgments, query selection, techniques to test the reliability of the evaluation and reusability of the constructed collections.
|Presented by:||Ben Carterette (University of Delaware, US)|
|Evangelos Kanoulas (University of Sheffield, UK)|
|Emine Yilmaz (Microsoft Research, Cambridge, UK)|
|mail the presenters|
|Tutorial materials:||Complete booklet of presentation slides|
|Bibliography on evaluation (with an emphasis on low-cost evaluation)|
|Of interest:||TREC Million Query Track, a testbed for low-cost evaluation methods|