6/1/2014 - 5/31/2019
The purpose of this project is to improve search systems' ability to help users complete tasks. The usefulness of any search engine ultimately depends on how good it is at aiding its users. The systems and the tasks they are used for can be very complicated; small changes in a system's implementation or a task's execution can have major effects on the usefulness of the system, especially over a long lifespan of use by a large base of people. The traditional approach to understanding utility involves the use of test collections, which consist of a collection of documents to be searched, unchanging information needs, and human judgments of the relevance of documents to needs; these components are put into a simple batch process that measures search effectiveness and tests simple statistical hypotheses. While this approach is useful, it often fails to capture variability present in users and tasks: different users often interact with the same system in very different ways, meaning a system that is useful for one user or one task may not be useful for another user or task. Therefore, this project focuses on developing new methods for understanding, estimating, and improving the usefulness of information retrieval (IR) systems that take variability into consideration.
The methods investigated in this project are designed to model user interactions with a system to complete a task, including how users determine relevance in context, how they modify their interaction with a system over time, and how different approaches by different users affect the overall system usefulness. The project will produce new types of test collections, evaluation measures, and statistical methods for batch-style systems-based information retrieval evaluation for use by researchers and practitioners in academia and industry. The work will demonstrate how to use these both to improve system utility to a population of users as well as to pose deeper hypotheses about causality in IR system development, thus leading to improvements in IR technology in all domains. Research will be integrated with educational activities for students as well as researchers and practitioners to learn advanced experimental design and analysis. Educational efforts will include tutorials and teaching courses on empirical methods in IR and computer science, methods in use in the wider scientific community, and how the newly developed methods relate to those.
Summary of Research Findings So Far
We organize results along three lines of research: variability in user preferences and behavior; variability in user interactions across sessions; and variability in evaluation using test collections and statistical significance testing.
A finding that "umbrellas" many of those below is that it is often possible for researchers in academia to use small, focused query logs in order to investigate research questions that would otherwise only be open to researchers in industry.
- We show that standard text-based similarity measures such as cosine similarity do not correlate well to user notions of document similarity. However, multiple text-based measures along with other features of documents can be used to train a classifier for similarity that has classification accuracy close to the level of human disagreement (Zengin & Carterette, CIKM 2015).
- Using the idea of conditional preferences that we introduced in earlier work, we show that users tend to prefer to see documents that cover many different aspects of a topic over documents that present novel information (with less coverage) or even documents that are more relevant (Bah, Chandar & Carterette, SIGIR 2015).
- We demonstrate that taking advantage of the historical "popularity" of documents across different user sessions for similar information needs, we can improve effectiveness of ranking for a new user with that information need (Bah & Carterette, AIRS 2015).
- We develop a model of user session abandonment that makes use of features computed from the query, documents in the most recent results obtained, and documents obtained earlier in the session. We show that using features computed from documents retrieved earlier in the session in particular makes a significant difference in predicting abandonment (Zengin & Carterette, AIRS 2016).
- Using many of the same features as described for our AIRS paper above, we show that using features computed from the history of the session can significantly improve predictions about user clicks for the current query in the session (Zengin & Carterette, CHIIR 2017).
- We develop simulation models of user interactions (query formulation, clicks, dwell times, and session abandonment) over sessions of searches to satisfy a particular information need. These models can be used to generate a simulated query log that can be used to evaluate search systems that make use of such data. (Carterette, Bah & Zengin, ICTIR 2015).
- We present a data fusion method to combine results from multiple searches on the same information need--either from one user's search history on that information need, or from other user's searches on that information need--to improve search results for the next query a user enters in the session (Bah & Carterette, AIRS 2014; Bah & Carterette, Tapia 2015).
- Using "alternative possible queries"--queries that a user could have submitted rather than the one they did--extracted from a user's search history in a session combined with the data fusion method described above, we improve search results for the next query a user enters in the session (Bah & Carterette, WI 2016).
- We develop a formal theoretical model of the data fusion techniques described above. This model can accept queries from a variety of sources, including the "altnerative possible queries" described immediately preceding, different user models of browsing results and generating query reformulations, providing a general framework for investigating improvement of search results over sessions (Bah & Carterette, ICTIR 2016).
- We present another query reformulation simulation model using text extraction tools to identify key terms and phrases from search results that would make reasonable user queries in a session (Bah & Carterette, DEXA 2016).
- We introduce a Bayesian evaluation model for IR meant to augment the type of analysis that statistical significance testing is usually used for. Our Bayesian framework models relevance of ranked documents directly, rather than modeling a summary evaluation measure, and as a result, standard evaluation measures like precision and nDCG emerge from it naturally (Carterette, ICTIR 2015).
- We use extreme value theory to argue that the "best published result" does not necessarily come from the best retrieval system, in the sense that a published result is necessarily over a small sample of queries. In fact, the "best published result" could look as much as 20% better than its "real" effectiveness. This has implications for the use of baselines and comparisons in evaluation (Carterette, SIGIR 2015).
- We present a simulation model to evaluate the "potential for generalizability" of published results. Based on results in tables in published papers, we can simulate the likelihood that a similar result would be observed in a different experimental setting. Our simulations suggest that up to 50% of published results would not generalize (Carterette & Karankumar, SIGIR 2015 RIGOR workshop).
- Through the TREC Tasks track, we have produced a test collection consisting of tasks that users wish to complete, subtasks that they need to complete in order to complete the overall task, and judgments of both relevance and usefulness to web pages as well as to short keyphrases automatically generated by IR systems (Yilmaz et al., TREC 2015; Verma et al., TREC 2016; Kanoulas et al., TREC 2017).
- We describe and compare four years of test collections released through the TREC Session Track, which ran from 2011 to 2014. We show how participating groups used the test collections to research search in sessions, and elucidate comparisons of results across differences in the collections (Carterette et al., SIGIR 2016).
- We analyze statistical test p-values from published papers in IR from 1995 through 2014 to understand the distribution of effect sizes being detected in IR research. Using a simple simulation methodology, we extrapolate to future effect sizes to show that, far from being "solved", major new results in IR are still waiting to be discovered (Carterette, SIGIR 2017).
Summary of Educational Efforts So Far
I have presented two tutorials on the theory and practice of statistical significance testing in IR (linked below). Both have been well-attended and well-received. The topic is important because many researchers treat statistical significance tests as a necessary component for a published [empirical] paper, yet at the same time understand them as little more than a "black box" that takes evaluation data as input and produces a p-value. Through this tutorial, I explain how to perform statistical significance tests in various IR research settings, how to interpret their results--showing that a p-value less than the commonly-accepted threshold of 0.05 can in fact have very little meaning about the actual usefulness of the system being tested--and how to ask the appropriate questions about the results of statistical significance tests. This tutorial is currently being edited into a monograph.
At the University of Delaware, I have taught a graduate-level elective course on Empirical Methods for Computer Science. The course covers topics in project selection, data analysis, hypothesis generation, experimental design, statistical analysis, and experimental validity. Students read papers from all areas of science, starting out very generally at the beginning before focusing more on specific questions in computer science as the semester progresses, write responses, and discuss in class. Students also do a project involving taking a research problem from proposal to pilot study to design of a solid experiment to answer an outstanding research question. (Unfortunately I cannot make very many materials available, since most of the papers we read are not published under Open Access models, student responses are private, and student discussion is not recorded. However the course syllabus and a bibliography of papers covered are linked below.)
- Ben Carterette, PI
- Karankumar Sabhnani (2014 - present)
- Mustafa Zengin (2015 - present)
- Yuqi Kong (2016 - present)
- Ashraf Bah (2014 - 2016)
- Li Ren (2014 - 2015)
We are collaborating with University College London, the University of Amsterdam, and Microsoft on coordination of the TREC Tasks track.
- Mustafa Zengin, Ben Carterette. User Click Detection in Ideal Sessions. In Proceedings of CHIIR 2017.
- Mustafa Zengin, Ben Carterette. Reformulate or Quit: Predicting User Abandonment in Ideal Sessions. In Proceedings of AIRS 2016.
- Ashraf Bah, Ben Carterette. Generating Pseudo Search History Data in the Absence of Real Search History. In Proceedings of DEXA 2016.
- Ashraf Bah, Ben Carterette. PDF: A Probabilistic Data Fusion Framework for Retrieval and Ranking. In Proceedings of ICTIR 2016.
- Ben Carterette, Paul D. Clough, Mark Hall, Evangelos Kanoulas, Mark Sanderson. Evaluating Retrieval Over Sessions: The TREC Session Track 2011--2014. In Proceedings of SIGIR 2016.
- Manisha Verma, Emine Yilmaz, Rishabh Mehrotra, Evangelos Kanoulas, Ben Carterette, Nick Craswell, Peter Bailey. Overview of the TREC 2016 Tasks Track. In Proceedings of TREC 2016.
- Ashraf Bah, Ben Carterette. Fusing Search Results from Possible Alternative Queries. In Proceedings of WI 2016.
- Emine Yilmaz, Evangelos Kanoulas, Manisha Verma, Ben Carterette, Nick Craswell, and Rishabh Mehrotra. Overview of the TREC 2015 Tasks Track. In Proceedings of TREC 2015 (pdf).
- Mustafa Zengin, Ben Carterette. Learning User Preferences for Topically Similar Documents.. In Proceedings of CIKM 2015 (pdf).
- Ashraf Bah, Ben Carterette. Improving Ranking and Robustness of Search Systems by Exploiting the Popularity of Documents. In Proceedings of AIRS 2015 (Springer link).
- Ben Carterette, Ashraf Bah, and Mustafa Zengin. Dynamic Test Collections for Retrieval Evaluation. In Proceedings of ICTIR 2015 (pdf).
- Ben Carterette. Bayesian Inference for Information Retrieval Evaluation. In Proceedings of ICTIR 2015 (pdf). Best Paper Award winner.
- Ashraf Bah, Praveen Chandar, and Ben Carterette. Document Comprehensiveness and User Preferences in Novelty Search Tasks. In Proceedings of SIGIR 2015 (pdf).
- Ben Carterette. The Best Published Result is Random: Sequential Testing and its Effect on Reported Effectiveness. In Proceedings of SIGIR 2015 (pdf).
- Ben Carterette and Karankumar Sabhnani. Using Simulation to Analyze the Potential for Reproducibility. In Proceedings of the SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR). (pdf)
- Ashraf Bah, Ben Carterette. Rank Aggregation for Web Search Over Sessions. Richard Tapia Conference, 2015 (pdf).
- Ashraf Bah, Ben Carterette. Aggregating Results from Multiple Related Queries to Improve Web Search Over Sessions. In Proceedings of AIRS 2014 (pdf).
- Ashraf Bah and Ben Carterette. Using 'Model' Pseudo-Documents to Improve Searching-as-Learning and Search Over Sessions. In Proceedings of the IIiX 2014 Workshop on Searching as Learning (pdf).
Department of Computer and Information Sciences
University of Delaware
Newark, DE USA
Last updated: June 9, 2017
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This material is based upon work supported by the National Science Foundation under Grant No.