See the track's web site at http://ir.cis.udel.edu/sessions for up-to-date information.
Research in Information Retrieval has traditionally focused on serving the best results for a query, for varying definitions of "best": for ad hoc tasks, the most relevant results; for novelty and diversity tasks, the results that do the best job of covering a space of information needs; for known-item tasks, the result the user is looking for; for filtering tasks, the latest and most relevant results. But users often begin an interaction with a search engine with a sufficiently ill-specified query that they will need to reformulate their query and/or information need several times before they find either the thing or every thing they are looking for. A search engine may be able to better serve a user not by ranking the most relevant results to each query in the sequence, but by ranking results that help "point the way" to what the user is really looking for, or by complementing results from previous queries in the sequence with new results, or in other currently-unanticipated ways.
We call a sequence of reformulations in service of satisfying an
information need a session, and the goal of this track is: (G1)
to test whether systems can improve their performance for a given
query by using a previous query, and (G2) to evaluate system
performance over an entire query session instead of a single query.
Task
For this first year, we will limit the focus of this track
to sessions of two queries, and further limit the focus to
particular types of sessions (defined
below).
A set of 150 query pairs (original
query, query reformulation) will be provided to participants
by NIST. For each such pair the participants will
submit 3 (three) ranked lists of
documents for three experimental conditions,
The collection is available from Carnegie Mellon University,
distributed on four hard disks that will be shipped to you (you get to
keep the disk). The entire collection can be obtained for US$790
while the Category B set can be obtained for US$240 (to cover the
costs of the drives and preparing the data) plus shipping.
Full details on how to acquire the ClueWeb collection are available here.
ex: "low carb high fat diet" -> "types of diets" ex: "us map" -> "us map states and capitals" ex: "music man performances" -> "music man script"
A set of 150 query pairs (original query, query
reformulation) will be provided to participants by NIST.
We have re-used the 2009
Web track queries. This collection has topics which have a
"main theme" and a series of "aspects". We used the aspect and
main theme of these collection topics in a variety of combinations
to provide a simulation of an initial and second query. A more complete description of the process is given in a 2-page paper that we will present at the SIGIR 2010 Workshop on Simulation of Interaction in Geneva on July 23rd.
Queries will be released in a colon-delimited text file; each line will consist of a query number, a keyword query, and a reformulation, without information about the reformulation type. For example:
Queries
For the first year, we will limit the
focus of this track to sessions of two queries, and further limit the
focus to particular types of sessions (defined below). This is partly
for pragmatic reasons regarding the difficulty of obtaining session
data, and partly for reasons of experimental design and analysis:
allowing longer sessions introduces many more degrees of freedom,
requiring more data for to draw proper conclusions.
For the first year, there would be three distinct types of
reformulations:
Full topic descriptions, including reformulation types, will be available after the run submission deadline.
1:low carb high fat diet:types of diets
2:us map:us map states and capitals
3:music man performances:music man script
Assessors will judge documents with respect to an information need provided by NIST. (Note: information needs will not be provided to participants.) In some cases, both queries will be assumed to represent the same information need, and assessors will judge all documents against that need, regardless of experimental condition. In other cases, the second query will represent a different (but related) information need, and judgments will be made differently depending on the condition.
Documents in the depth-10 pool will be judged for relevance so that
evaluation metrics such as PC(10) and nDCG(10) can be exactly
estimated. If further resources are available, the remaining of the
judgements will be done by randomly selected documents further down
the ranked lists (in a Terabyte style). The methodology described in
Yilmaz et al., SIGIR 2008, ("A simple and efficient sampling
method for estimating AP and NDCG") will be then employed to estimate
metrics such as average precision, nDCG, etc. Further, the method
described in
Carterette et al., SIGIR 2007, ("Robust test collections for
retrieval evaluation"), may also be used to produce the ranking of
participating runs based on estimates of different metrics. The
aforementioned evaluation metrics will be used in G1, i.e. to
evaluate the effectiveness of retrieval systems over the ranked
lists, RL2 and RL3 and compare their performance.
We will also evaluate the pairs of runs RL1 to RL2 and RL1 to RL3 in order to evaluate the ability of the original query/results to provide information for reformulations. The overall session evaluation for these will be done by
session-specific measures such as session nDCG
(Jarvelin et al. ECIR 2008, "Discounted Cumulated Gain Based
Evaluation of Multiple-Query IR Sessions") as well as other measures we are currently developing. In our new measures, a document ID that appears in both RL1 and RL2/RL3 will be penalized in RL2/RL3, with the penalization decreasing by the depth at which it appeared in RL1. For example, if document A appears at rank 1 in RL1, it will be heavily penalized for reappearing in RL2 or RL3. If document B appears at rank 100 in RL1, it will not be penalized much for reappearing in RL2 or RL3. The exact form of the penalization has yet to be determined.
The contents of the columns are:
A submission consisting of three files (mysys1.RL1, mysys1.RL2, mysys1.RL3) might look like this:
Your submission should be sorted by rank within topic number. By
implication it will also be sorted in descending order of score.Evaluation measures
The primary measure by which systems will be compared is nDCG@10 on the second query in the session. Thus you can do well on the track simply by having a good ad hoc system, but you can potentially do better if you are able to make use of the first query and results.
Important dates
How to participate
To participate in the Session track, you
must apply to participate
in TREC with NIST. (The official deadline is in February
18th.) However, you must apply if you wish to participate, and
you must participate (in some track) if you wish to attend the TREC
meeting in November.
Run submissions
Official run protocol
.
Submission format
Each submission includes three separate ranked result lists for all 150 topics. Files should be named "runTag.RLn", where "runTag" is a unique identifier for your group and the particular submission, and "RLn" is RL1, RL2, or RL3, depending on the experimental condition. All three files must be present for a valid submission.
Each file should be in standard NIST format: a single ASCII text
file with white space separating columns. The width of
the columns is not important but you must have exactly six columns per
line with at least one space between the columns.
A script for verifying that submissions are valid will be provided.
$ cat mysys1.RL1
1 Q0 clueweb09-en0010-21-23199 1 9876 mysys1
1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1
1 Q0 clueweb09-enwp04-22-09182 3 9874 mysys1
...
$ cat mysys1.RL2
1 Q0 clueweb09-en0010-21-23200 1 9963 mysys1
1 Q0 clueweb09-en0481-51-84332 2 9960 mysys1
1 Q0 clueweb09-enwp04-22-09992 3 9954 mysys1
...
$ cat mysys1.RL3
1 Q0 clueweb09-en0010-21-23199 1 9877 mysys1
1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1
1 Q0 clueweb09-enwp04-22-09992 3 9870 mysys1
...
If you would normally return no documents for a query, instead return
the single document "clueweb09-en0000-00-00000" at rank
one. Doing so maintains consistent evaluation results (averages
over the same number of queries) and does not break anyone's tools.
Various rules