This material is based upon work supported by the National Science Foundation under Grant No. IIS-1350799.

Project Duration

6/1/2014 - 5/31/2019

Project Goals

The purpose of this project is to improve search systems' ability to help users complete tasks. The usefulness of any search engine ultimately depends on how good it is at aiding its users. The systems and the tasks they are used for can be very complicated; small changes in a system's implementation or a task's execution can have major effects on the usefulness of the system, especially over a long lifespan of use by a large base of people. The traditional approach to understanding utility involves the use of test collections, which consist of a collection of documents to be searched, unchanging information needs, and human judgments of the relevance of documents to needs; these components are put into a simple batch process that measures search effectiveness and tests simple statistical hypotheses. While this approach is useful, it often fails to capture variability present in users and tasks: different users often interact with the same system in very different ways, meaning a system that is useful for one user or one task may not be useful for another user or task. Therefore, this project focuses on developing new methods for understanding, estimating, and improving the usefulness of information retrieval (IR) systems that take variability into consideration.

The methods investigated in this project are designed to model user interactions with a system to complete a task, including how users determine relevance in context, how they modify their interaction with a system over time, and how different approaches by different users affect the overall system usefulness. The project will produce new types of test collections, evaluation measures, and statistical methods for batch-style systems-based information retrieval evaluation for use by researchers and practitioners in academia and industry. The work will demonstrate how to use these both to improve system utility to a population of users as well as to pose deeper hypotheses about causality in IR system development, thus leading to improvements in IR technology in all domains. Research will be integrated with educational activities for students as well as researchers and practitioners to learn advanced experimental design and analysis. Educational efforts will include tutorials and teaching courses on empirical methods in IR and computer science, methods in use in the wider scientific community, and how the newly developed methods relate to those.

Summary of Research Findings So Far

We organize results along three lines of research: variability in user preferences and behavior; variability in user interactions across sessions; and variability in evaluation using test collections and statistical significance testing.

A finding that "umbrellas" many of those below is that it is often possible for researchers in academia to use small, focused query logs in order to investigate research questions that would otherwise only be open to researchers in industry.

Summary of Educational Efforts So Far

I have presented two tutorials on the theory and practice of statistical significance testing in IR (linked below). Both have been well-attended and well-received. The topic is important because many researchers treat statistical significance tests as a necessary component for a published [empirical] paper, yet at the same time understand them as little more than a "black box" that takes evaluation data as input and produces a p-value. Through this tutorial, I explain how to perform statistical significance tests in various IR research settings, how to interpret their results--showing that a p-value less than the commonly-accepted threshold of 0.05 can in fact have very little meaning about the actual usefulness of the system being tested--and how to ask the appropriate questions about the results of statistical significance tests. This tutorial is currently being edited into a monograph.

At the University of Delaware, I have taught a graduate-level elective course on Empirical Methods for Computer Science. The course covers topics in project selection, data analysis, hypothesis generation, experimental design, statistical analysis, and experimental validity. Students read papers from all areas of science, starting out very generally at the beginning before focusing more on specific questions in computer science as the semester progresses, write responses, and discuss in class. Students also do a project involving taking a research problem from proposal to pilot study to design of a solid experiment to answer an outstanding research question. (Unfortunately I cannot make very many materials available, since most of the papers we read are not published under Open Access models, student responses are private, and student discussion is not recorded. However the course syllabus and a bibliography of papers covered are linked below.)



We are collaborating with University College London, the University of Amsterdam, and Microsoft on coordination of the TREC Tasks track.



Educational Resources


Ben Carterette
Associate Professor
Department of Computer and Information Sciences
University of Delaware
Newark, DE USA

Last updated: June 9, 2017


Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.