Track Guidelines

See the track's web site at for up-to-date information.

Table of Contents

  1. Tasks
  2. Track data
    1. Corpus
    2. Sessions
  3. Evaluation
    1. Relevance judgments and judging
    2. Evaluation measures
  4. Important dates
  5. How to participate
  6. Run submissions

Research in Information Retrieval has traditionally focused on serving the best results for a query, for varying definitions of "best": for ad hoc tasks, the most relevant results; for novelty and diversity tasks, the results that do the best job of covering a space of information needs; for known-item tasks, the result the user is looking for; for filtering tasks, the latest and most relevant results. But users often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate their query and/or information need several times before they find either the thing or every thing they are looking for. A search engine may be able to better serve a user not by ranking the most relevant results to each query in the sequence, but by ranking results that help "point the way" to what the user is really looking for, or by complementing results from previous queries in the sequence with new results, or in other currently-unanticipated ways.

We call a sequence of reformulations (along with any user interaction with the retrieval system) in service of satisfying an information need a session, and the goal of this track is: (G1) to test whether systems can improve their performance for a given query by using previous queries and user interactions with the retrieval system (including clicks on ranked results, dwell times, etc.), and (G2) to evaluate system performance over an entire query session instead of a single query.


This year we will only evaluate (G1), thus there will only be a single task. Participants will be provided with a set of query sessions. Each session will consist of,

Participants will then run their retrieval system over the current query,

Comparing the retrieval effectiveness in (RL1) with the retrieval effectiveness in (RL2)--(RL4) we can evaluate whether a retrieval system can use prior to a query information to improve the results of the current query.

Note that this is not an interactive track. Query sessions will be provided by NIST.

Track data


The track will use the
ClueWeb09 collection. The full collection consists of roughly 1 billion web pages, comprising approximately 25TB of uncompressed data (5TB compressed) in 10 languages. The dataset was crawled from the Web during January and February 2009. Participants are encouraged to use the entire collection, however submissions over the smaller "Category B" collection of 50 million documents will be accepted. Note that Category B submissions will be evaluated as if they were Category A submissions.

The collection is available from Carnegie Mellon University, distributed on four hard disks that will be shipped to you (you get to keep the disk). The entire collection can be obtained for US$610 while the Category B set can be obtained for US$220 (to cover the costs of the drives and preparing the data) plus shipping. Full details on how to acquire the ClueWeb09 collection are available here.

Data derived from the ClueWeb09 corpus such as Duplicate URLs, PageRank scores, Web Graph, Redirects, Anchor Text and Spam Rankings can be found here.


In 2010 we limited the focus of the track to sessions of two queries, and further limited the focus to particular types of sessions (generalization / specification / drifting). The query sessions were developed by the track coordinators using topics from the TREC 2009 and 2010 Web track (diversity task). A complete description of the process is given in a 2-page paper that was presented at the SIGIR 2010 Workshop on Simulation of Interaction in Geneva on July 23rd, 2010.

A drawback of the sessions released in 2010 was the fact that the second query in the session was independent of the results returned by the first query. This year sessions have been collected from actual user activities. The TREC 2009 Million Query track and the TREC 2007 Question Answering track query sets were used to construct a set of topics. E.g.

MQ topic : 20419

QA topic : 216 A custom-built search interface was developed that provides instructions to users on the tasks to be conducted and was used to gather data about a user's session (see: This included logging the user's interaction with the retrieval system, such as the queries issued, query reformulations and items clicked. Users were further asked to judge the web pages they observed with respect to their relevance to their information need. All logged information was anonymous and no information about the users was acquired. Each user was prompted to select one out of ten topics and use a provided search engine to search the for the selected topic. The search engine used was Yahoo! BOSS and the web search results were filtered against the ClueWeb09 collection before they were shown to the user. Thus each session this year consists of a topic, a set of queries actual users posed to Yahoo! BOSS about the topic, the returned results and the user interactions with the returned results. The session set will come in an XML file format. For example in the case of RL4:

<session num="1" starttime="12:39:02.055014">
       <interaction num="1" starttime="12:39:10.280644" >
          <query>wikipedia cosmetic laser treatment</query>
             <result rank="1">
                <title>Varicose Veins - Vein Treatment, Removal, Surgery Information</title>
                <snippet>... concern but can lead to more severe problems such as leg pain, leg 
                               swelling and leg cramps. View photos and find a varicose vein treatment 
                               center. ...</snippet>
             <result rank="2">
                <title>Laser and IPL hair removal - Treatments - Peach Cosmetic ...</title>
                <snippet>Laser hair removal served as Dr Mahony's introduction to cosmetic medicine 
                               back in 1999. ... Both our IPL and our laser offer skin chilling as part of
                               the treatment. ...</snippet>


             <result rank="10">
                <title>Cosmetic Surgery, Cosmetic Doctors, Cosmetic Physicians, and ...</title>
                <snippet>Cosmetic Surgery 10 is a resource that provides key information on cosmetic 
                               surgeries focusing on plastic surgeries, dermatology, cosmetic dentists and 
                               LASIK procedures.</snippet>
             <click num="1" starttime="12:40:15.603468" endtime="12:42:10.565420">
             <click num="2" starttime="12:42:18.244467" endtime="12:43:41.841436>
       <interaction num="2">


       <currentquery starttime="12:44:12.659006">
          <query>uses for cosmetic laser treatment</query>

Note that most of the released sessions will consist of a single previous query (along with the user interactions) and the current query. However, there may be sessions with longer history (e.g. two or three previous queries).


Relevance judgments and judging

Judging will be done by assessors at NIST.

Assessors will judge documents with respect to an information need provided by NIST. (Note: information needs will not be provided to participants.) In some cases, both queries will be assumed to represent the same information need, and assessors will judge all documents against that need, regardless of experimental condition. In other cases, the second query will represent a different (but related) information need, and judgments will be made differently depending on the condition.

Documents in the depth-10 pool will be judged for relevance so that evaluation metrics such as PC(10) and nDCG(10) can be exactly estimated. If further resources are available, the remaining of the judgments will be done by randomly selected documents further down the ranked lists (in a Terabyte style). The methodology described in Yilmaz et al., SIGIR 2008, ("A simple and efficient sampling method for estimating AP and NDCG") will be then employed to estimate metrics such as average precision, nDCG, etc. Further, the method described in Carterette et al., SIGIR 2007, ("Robust test collections for retrieval evaluation"), may also be used to produce the ranking of participating runs based on estimates of different metrics. The aforementioned evaluation metrics will be used in G1, i.e. to evaluate the effectiveness of retrieval systems over the ranked lists, RL1 -- RL4 and compare their performance.

Evaluation measures

The primary measure by which systems will be compared is nDCG@10 on the current query in the session. Thus you can do well on the track simply by having a good ad hoc system, but you can potentially do better if you are able to make use of the session prior to the current query.

Web pages ranked for the current query that have appeared in the ranked lists of past queries and have been clicked by the user will be considered as duplicates. Duplicate results will not be directly penalised but they will be removed from the ranked lists RL1 -- RL4 during evaluation.

Important dates

May 27 Submit your application to participate in TREC 2011
Early June Guidelines finalized; Sessions released.
August 3 Results submission deadline.
September Relevance judgments and individual evaluation scores due back to participants (estimated).
November 15-18 TREC 2011 conference at NIST in Gaithersburg, MD, USA (you must participate in some track to attend).

How to participate

To participate in the Session track, you must
apply to participate in TREC with NIST. (The official deadline was in February but applications will be accepted until May 27th.) You must apply if you wish to participate, and you must participate (in some track) if you wish to attend the TREC meeting in November.

Run submissions

Official run protocol

  1. Download the topics from the TREC Active Participants page (password required). Note that there are four separate topic files, one for each of the four conditions described above.
  2. Run the last query in each topic once for each of the four conditions; put the results in the format described below.
  3. Upload the ranked lists to NIST (a link will be provided; you will need the TREC password).  
You may provide up to three submissions for judging. Each submission should include four runs (RL1, RL2, RL3, RL4) and can have up to 2,000 documents ranked for each query (you may have fewer, but not more).  If you are returning zero documents for a query, instead return the single document "clueweb09-en0000-00-00000".

Submission format

Each submission includes four separate ranked result lists for all topics. Files should be named "runTag.RLn", where "runTag" is a unique identifier for your group and the particular submission, and "RLn" is RL1, RL2, RL3, or RL4, depending on the experimental condition. All four files must be present for a valid submission. Each file should be in standard NIST format: a single ASCII text file with white space separating columns.  The width of the columns is not important but you must have exactly six columns per line with at least one space between the columns.

The contents of the columns are:

  1. The first column is the topic number.
  2. The second column should always contain the string "Q0" (letter "Q" followed by number "0").
  3. The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
  4. The fourth column is the rank of that document for that query.  Within a query, each of the numbers from 1 to 2000 should appear exactly once.  (If you retrieve fewer than 2000 documents, then you will have the numbers from 1 to that number.)
  5. The fifth column is the score your system generated to rank this document, either as an integer or a floating point number.  Scores must be in descending order.  Note that typical TREC evaluations use this column, not the rank column, to evaluate systems.  If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  6. The sixth column is your "run tag" and is a unique identifier for your group and the particular run.  Please change the tag from year to year, track to track, and run to run, so that different approaches can be compared.  Run tags may contain 12 or fewer letters and numbers with no punctuation (and no white space, or the line would have more than six columns). Each of the four files comprising a submission should have the same run tag.
A script for verifying that submissions are valid will be provided.

A submission consisting of four files (mysys1.RL1, mysys1.RL2, mysys1.RL3, mysys1.RL4) might look like this:

 $ cat mysys1.RL1 
 1 Q0 clueweb09-en0010-21-23199 1 9876 mysys1 
 1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1 
 1 Q0 clueweb09-enwp04-22-09182 3 9874 mysys1 
$ cat mysys1.RL2 1 Q0 clueweb09-en0010-21-23200 1 9963 mysys1 1 Q0 clueweb09-en0481-51-84332 2 9960 mysys1 1 Q0 clueweb09-enwp04-22-09992 3 9954 mysys1 ...
$ cat mysys1.RL3 1 Q0 clueweb09-en0010-21-23199 1 9877 mysys1 1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1 1 Q0 clueweb09-enwp04-22-09992 3 9870 mysys1 ...
$ cat mysys1.RL4 1 Q0 clueweb09-en0010-21-23200 1 9999 mysys1 1 Q0 clueweb09-en0010-21-23199 2 9998 mysys1 ...

Your submission should be sorted by rank within topic number.  By implication it will also be sorted in descending order of score.

If you would normally return no documents for a query, instead return the single document "clueweb09-en0000-00-00000" at rank one.  Doing so maintains consistent evaluation results (averages over the same number of queries) and does not break anyone's tools.

Various rules

  1. A manual run is one in which a person is somehow involved in the process of converting a query into a ranked list, whether by formulating the query by hand, modifying the query by hand, classifying the query by hand, or adjusting the ranked list by hand.  
  2. If you used the provided topic descriptions in any way, say so when you submit your runs.