Track Guidelines

See the track's web site at http://ir.cis.udel.edu/sessions for up-to-date information.

Table of Contents

  1. Task
  2. Track data
    1. Corpus
    2. Queries
  3. Evaluation
    1. Relevance judgments and judging
    2. Evaluation measures
  4. Important dates
  5. How to participate
  6. Run submissions

Research in Information Retrieval has traditionally focused on serving the best results for a query, for varying definitions of "best": for ad hoc tasks, the most relevant results; for novelty and diversity tasks, the results that do the best job of covering a space of information needs; for known-item tasks, the result the user is looking for; for filtering tasks, the latest and most relevant results. But users often begin an interaction with a search engine with a sufficiently ill-specified query that they will need to reformulate their query and/or information need several times before they find either the thing or every thing they are looking for. A search engine may be able to better serve a user not by ranking the most relevant results to each query in the sequence, but by ranking results that help "point the way" to what the user is really looking for, or by complementing results from previous queries in the sequence with new results, or in other currently-unanticipated ways.

We call a sequence of reformulations in service of satisfying an information need a session, and the goal of this track is: (G1) to test whether systems can improve their performance for a given query by using a previous query, and (G2) to evaluate system performance over an entire query session instead of a single query.

Task

For this first year, we will limit the focus of this track to sessions of two queries, and further limit the focus to particular types of sessions (defined below). A set of 150 query pairs (original query, query reformulation) will be provided to participants by NIST. For each such pair the participants will submit 3 (three) ranked lists of documents for three experimental conditions,
  1. one over the original query (RL1)
  2. one over the query reformulation, ignoring the original query (RL2)
  3. one over the query reformulation taking into consideration the original query (RL3)
By using the ranked lists (RL2) and (RL3) we will evaluate the ability of systems to utilize prior history (G1). By using the returned ranked lists (RL1) and (RL3) we will evaluate the quality of ranking function over the entire session (G2). Note that this will not be an interactive track. Query reformulations will be provided by NIST. Further note that when retrieving results for (RL3) the only extra information about the user's intent will be the original query. This will be an one phase track, with no feedback provided by the assessors.

Track data

Corpus

The track will use the new
ClueWeb09 collection. The full collection consists of roughly 1 billion web pages, comprising approximately 25TB of uncompressed data (5TB compressed) in multiple languages. The dataset was crawled from the Web during January and February 2009. Participants are encouraged to use the entire collection, however submissions over the smaller "Category B" collection of 50 million documents will be accepted. Note that Category B submissions will be evaluated as if they were Category A submissions.

The collection is available from Carnegie Mellon University, distributed on four hard disks that will be shipped to you (you get to keep the disk).  The entire collection can be obtained for US$790 while the Category B set can be obtained for US$240 (to cover the costs of the drives and preparing the data) plus shipping. Full details on how to acquire the ClueWeb collection are available here.

Queries

For the first year, we will limit the focus of this track to sessions of two queries, and further limit the focus to particular types of sessions (defined below). This is partly for pragmatic reasons regarding the difficulty of obtaining session data, and partly for reasons of experimental design and analysis: allowing longer sessions introduces many more degrees of freedom, requiring more data for to draw proper conclusions. For the first year, there would be three distinct types of reformulations:
  1. generalizations: the user entered a query, realized that the results were too narrow or that she wanted a wider range of information, and reformulated a more general query.

    ex: "low carb high fat diet" -> "types of diets"

  2. specifications: the user entered a query, realized the results were too broad or that he wanted a deeper amount of information, and reformulated a more specific query.

    ex: "us map" -> "us map states and capitals"

  3. drifting: the user entered a query, then reformulated to another query with the same level of specification but to a different aspect or facet of the information need.

    ex: "music man performances" -> "music man script"

A set of 150 query pairs (original query, query reformulation) will be provided to participants by NIST. We have re-used the 2009 Web track queries. This collection has topics which have a "main theme" and a series of "aspects". We used the aspect and main theme of these collection topics in a variety of combinations to provide a simulation of an initial and second query. A more complete description of the process is given in a 2-page paper that we will present at the SIGIR 2010 Workshop on Simulation of Interaction in Geneva on July 23rd.

Queries will be released in a colon-delimited text file; each line will consist of a query number, a keyword query, and a reformulation, without information about the reformulation type. For example:

1:low carb high fat diet:types of diets
2:us map:us map states and capitals
3:music man performances:music man script
Full topic descriptions, including reformulation types, will be available after the run submission deadline.

Evaluation

Relevance judgments and judging

Judging will be done by assessors at NIST.

Assessors will judge documents with respect to an information need provided by NIST. (Note: information needs will not be provided to participants.) In some cases, both queries will be assumed to represent the same information need, and assessors will judge all documents against that need, regardless of experimental condition. In other cases, the second query will represent a different (but related) information need, and judgments will be made differently depending on the condition.

Documents in the depth-10 pool will be judged for relevance so that evaluation metrics such as PC(10) and nDCG(10) can be exactly estimated. If further resources are available, the remaining of the judgements will be done by randomly selected documents further down the ranked lists (in a Terabyte style). The methodology described in Yilmaz et al., SIGIR 2008, ("A simple and efficient sampling method for estimating AP and NDCG") will be then employed to estimate metrics such as average precision, nDCG, etc. Further, the method described in Carterette et al., SIGIR 2007, ("Robust test collections for retrieval evaluation"), may also be used to produce the ranking of participating runs based on estimates of different metrics. The aforementioned evaluation metrics will be used in G1, i.e. to evaluate the effectiveness of retrieval systems over the ranked lists, RL2 and RL3 and compare their performance.

Evaluation measures

The primary measure by which systems will be compared is nDCG@10 on the second query in the session. Thus you can do well on the track simply by having a good ad hoc system, but you can potentially do better if you are able to make use of the first query and results.

We will also evaluate the pairs of runs RL1 to RL2 and RL1 to RL3 in order to evaluate the ability of the original query/results to provide information for reformulations. The overall session evaluation for these will be done by session-specific measures such as session nDCG (Jarvelin et al. ECIR 2008, "Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions") as well as other measures we are currently developing. In our new measures, a document ID that appears in both RL1 and RL2/RL3 will be penalized in RL2/RL3, with the penalization decreasing by the depth at which it appeared in RL1. For example, if document A appears at rank 1 in RL1, it will be heavily penalized for reappearing in RL2 or RL3. If document B appears at rank 100 in RL1, it will not be penalized much for reappearing in RL2 or RL3. The exact form of the penalization has yet to be determined.

Important dates

How to participate

To participate in the Session track, you must
apply to participate in TREC with NIST.  (The official deadline is in February 18th.)  However, you must apply if you wish to participate, and you must participate (in some track) if you wish to attend the TREC meeting in November.

Run submissions

Official run protocol

  1. Download the 150 topics, each with two queries (a link will be provided).
  2. Run the first query in each pair once (RL1) and the second query twice (RL2, RL3); put results in the format described below.
  3. Upload the ranked lists to NIST (a link will be provided; you will need the TREC password).  
You may provide up to three submissions for judging. Each submission should include three runs (RL1, RL2, RL3) and should have up to 2,000 documents ranked for each query (you may have fewer, but not more).  If you are returning zero documents for a query, instead return the single document "clueweb09-en0000-00-00000".

Submission format

Each submission includes three separate ranked result lists for all 150 topics. Files should be named "runTag.RLn", where "runTag" is a unique identifier for your group and the particular submission, and "RLn" is RL1, RL2, or RL3, depending on the experimental condition. All three files must be present for a valid submission. Each file should be in standard NIST format: a single ASCII text file with white space separating columns.  The width of the columns is not important but you must have exactly six columns per line with at least one space between the columns.

The contents of the columns are:

  1. The first column is the topic number.
  2. The second column should always contain the string "Q0" (letter "Q" followed by number "0").
  3. The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
  4. The fourth column is the rank of that document for that query.  Within a query, each of the numbers from 1 to 2000 should appear exactly once.  (If you retrieve fewer than 2000 documents, then you will have the numbers from 1 to that number.)
  5. The fifth column is the score your system generated to rank this document, either as an integer or a floating point number.  Scores must be in descending order.  Note that typical TREC evaluations use this column, not the rank column, to evaluate systems.  If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  6. The sixth column is your "run tag" and is a unique identifier for your group and the particular run.  Please change the tag from year to year, track to track, and run to run, so that different approaches can be compared.  Run tags may contain 12 or fewer letters and numbers with no punctuation (and no white space, or the line would have more than six columns). Each of the three files comprising a submission should have the same run tag.
A script for verifying that submissions are valid will be provided.

A submission consisting of three files (mysys1.RL1, mysys1.RL2, mysys1.RL3) might look like this:

$ cat mysys1.RL1
1 Q0 clueweb09-en0010-21-23199 1 9876 mysys1
1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1
1 Q0 clueweb09-enwp04-22-09182 3 9874 mysys1
...

$ cat mysys1.RL2 1 Q0 clueweb09-en0010-21-23200 1 9963 mysys1 1 Q0 clueweb09-en0481-51-84332 2 9960 mysys1 1 Q0 clueweb09-enwp04-22-09992 3 9954 mysys1 ...
$ cat mysys1.RL3 1 Q0 clueweb09-en0010-21-23199 1 9877 mysys1 1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1 1 Q0 clueweb09-enwp04-22-09992 3 9870 mysys1 ...

Your submission should be sorted by rank within topic number.  By implication it will also be sorted in descending order of score.

If you would normally return no documents for a query, instead return the single document "clueweb09-en0000-00-00000" at rank one.  Doing so maintains consistent evaluation results (averages over the same number of queries) and does not break anyone's tools.

Various rules

  1. A manual run is one in which a person is somehow involved in the process of converting a query into a ranked list, whether by formulating the query by hand, modifying the query by hand, classifying the query by hand, or adjusting the ranked list by hand.  
  2. A system may not be modified in light of the set of queries sent.  You should not look at the evaluation queries before you are ready to run your system.
  3. For the ranked set RL2 described above, you should not use the original query in any manner to improve your results.