Track Guidelines

See the track's web site at for up-to-date information.

Table of Contents

  1. Tasks
  2. Track data
    1. Corpus
    2. Sessions
  3. Evaluation
    1. Relevance judgments and judging
    2. Evaluation measures
  4. Important dates
  5. How to participate
  6. Run submissions

Research in Information Retrieval has traditionally focused on serving the best results for a query, for varying definitions of "best": for ad hoc tasks, the most relevant results; for novelty and diversity tasks, the results that do the best job of covering a space of information needs; for known-item tasks, the result the user is looking for; for filtering tasks, the latest and most relevant results. But users often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate their query and/or information need several times before they find either the thing or every thing they are looking for. A search engine may be able to better serve a user not by ranking the most relevant results to each query in the sequence, but by ranking results that help "point the way" to what the user is really looking for, or by complementing results from previous queries in the sequence with new results, or in other currently-unanticipated ways.

We call a sequence of reformulations (along with any user interaction with the retrieval system) in service of satisfying an information need a session, and the goal of this track is: (G1) to test whether systems can improve their performance for a given query by using previous queries and user interactions with the retrieval system (including clicks on ranked results, dwell times, etc.), and (G2) to evaluate system performance over an entire query session instead of a single query.


As we did last year, in 2012 we will only evaluate (G1), thus there will only be a single task. Participants will be provided with a set of query sessions. Each session will consist of,

Participants will then run their retrieval system over the current query,

Comparing the retrieval effectiveness in (RL1) with the retrieval effectiveness in (RL2)--(RL4) we can evaluate whether a retrieval system can use prior to a query information to improve the results of the current query.

Note that this is not an interactive track. Query sessions will be provided by NIST.

Track data


The track will use the
ClueWeb09 collection. The full collection consists of roughly 1 billion web pages, comprising approximately 25TB of uncompressed data (5TB compressed) in 10 languages. The dataset was crawled from the Web during January and February 2009. Participants are encouraged to use the entire collection, however submissions over the smaller "Category B" collection of 50 million documents will be accepted. Note that Category B submissions will be evaluated as if they were Category A submissions.

The collection is available from Carnegie Mellon University, distributed on four hard disks that will be shipped to you (you get to keep the disk). The entire collection can be obtained for US$580 while the Category B set can be obtained for US$190 (to cover the costs of the drives and preparing the data) plus shipping. Full details on how to acquire the ClueWeb09 collection are available here.

Data derived from the ClueWeb09 corpus such as duplicate URLs, PageRank scores, web graphs, redirects, anchor text and spam rankings can be found here.


Rather than distributing topics as most tracks do, the session track distributes sessions of interactions of users in the process of satisfying some information need. As in 2011, sessions have been collected from actual user search activity for topics derived from previous years' TREC tracks.

Session data is distributed in an XML file format. For example:

<session num="1" starttime="12:39:02.055014">
       <interaction num="1" starttime="12:39:10.280644" >
          <query>wikipedia cosmetic laser treatment</query>
             <result rank="1">
                <title>Varicose Veins - Vein Treatment, Removal, Surgery Information</title>
                <snippet>... concern but can lead to more severe problems such as leg pain, leg 
                               swelling and leg cramps. View photos and find a varicose vein treatment 
                               center. ...</snippet>
             <result rank="2">
                <title>Laser and IPL hair removal - Treatments - Peach Cosmetic ...</title>
                <snippet>Laser hair removal served as Dr Mahony's introduction to cosmetic medicine 
                               back in 1999. ... Both our IPL and our laser offer skin chilling as part of
                               the treatment. ...</snippet>


             <result rank="10">
                <title>Cosmetic Surgery, Cosmetic Doctors, Cosmetic Physicians, and ...</title>
                <snippet>Cosmetic Surgery 10 is a resource that provides key information on cosmetic 
                               surgeries focusing on plastic surgeries, dermatology, cosmetic dentists and 
                               LASIK procedures.</snippet>
             <click num="1" starttime="12:40:15.603468" endtime="12:42:10.565420">
             <click num="2" starttime="12:42:18.244467" endtime="12:43:41.841436>
       <interaction num="2">


       <currentquery starttime="12:44:12.659006">
          <query>uses for cosmetic laser treatment</query>

When constructing your submissions, please follow these guidelines for using information in the XML file:

  1. For your RL1 submission(s), ignore all <interaction> blocks. Use only the <currentquery>.
  2. For your RL2 submission(s), you may use the <query> fields in <interaction> blocks in addition to the <currentquery>. These are previous queries in the session.
  3. For your RL3 submission(s), you may use any information in the <result> blocks in addition to the <query> and <currentquery>.
  4. For your RL4 submission(s), you may use the <clicked> blocks in addition to the <result>, <query>, and <currentquery>.
Please do not use any information in the <topic> field!


Relevance judgments and judging

Judging will be done by assessors at NIST.

Assessors will judge documents with respect to an information need provided by NIST. (Note: information needs will not be provided to participants.) In some cases, both queries will be assumed to represent the same information need, and assessors will judge all documents against that need, regardless of experimental condition. In other cases, the second query will represent a different (but related) information need, and judgments will be made differently depending on the condition.

Documents in the depth-10 pool will be judged for relevance so that evaluation metrics such as PC(10) and nDCG(10) can be exactly estimated. If further resources are available, additional judgments will be done by randomly selected documents further down the ranked lists (in a Terabyte style). The aforementioned evaluation metrics will be used in G1, i.e. to evaluate the effectiveness of retrieval systems over the ranked lists, RL1 -- RL4 and compare their performance.

Evaluation measures

The primary measure by which systems will be compared is nDCG@10 on the current query in the session. Thus you can do well on the track simply by having a good ad hoc system, but you can potentially do better if you are able to make use of the session prior to the current query.

Web pages ranked for the current query that have appeared in the ranked lists of past queries and have been clicked by the user will be considered as duplicates. Duplicate results will not be directly penalised but they will be removed from the ranked lists RL1 -- RL4 during evaluation.

Important dates

May 27 Submit your application to participate in TREC 2012
Early June Guidelines finalized; sample sessions released.
Mid JuneSessions released.
August 29 Results submission deadline.
September 30 Relevance judgments and individual evaluation scores due back to participants (estimated).
November 6-9 TREC 2012 conference at NIST in Gaithersburg, MD, USA (you must participate in some track to attend).

How to participate

To participate in the Session track, you must
apply to participate in TREC with NIST. You must apply if you wish to participate, and you must participate (in some track) if you wish to attend the TREC meeting in November.

Run submissions

Official run protocol

  1. Download the topics from the TREC Active Participants page (password required). As of today, the topics are not yet available.. Note that there will be four separate topic files, one for each of the four conditions described above.
  2. Run the last query in each topic once for each of the four conditions; put the results in the format described below.
  3. Upload the ranked lists to NIST (a link will be provided; you will need the TREC password).  
You may provide up to three submissions for judging. Each submission should include four runs (RL1, RL2, RL3, RL4) and can have up to 2,000 documents ranked for each query (you may have fewer, but not more).  If you are returning zero documents for a query, instead return the single document "clueweb09-en0000-00-00000".

Submission format

Each submission includes four separate ranked result lists for all topics. Files should be named "runTag.RLn", where "runTag" is a unique identifier for your group and the particular submission, and "RLn" is RL1, RL2, RL3, or RL4, depending on the experimental condition. All four files must be present for a valid submission. Each file should be in standard NIST format: a single ASCII text file with white space separating columns.  The width of the columns is not important but you must have exactly six columns per line with at least one space between the columns.

The contents of the columns are:

  1. The first column is the topic number.
  2. The second column should always contain the string "Q0" (letter "Q" followed by number "0").
  3. The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
  4. The fourth column is the rank of that document for that query.  Within a query, each of the numbers from 1 to 2000 should appear exactly once.  (If you retrieve fewer than 2000 documents, then you will have the numbers from 1 to that number.)
  5. The fifth column is the score your system generated to rank this document, either as an integer or a floating point number.  Scores must be in descending order.  Note that typical TREC evaluations use this column, not the rank column, to evaluate systems.  If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  6. The sixth column is your "run tag" and is a unique identifier for your group and the particular run.  Please change the tag from year to year, track to track, and run to run, so that different approaches can be compared.  Run tags may contain 12 or fewer letters and numbers with no punctuation (and no white space, or the line would have more than six columns). Each of the four files comprising a submission should have the same run tag.
A script for verifying that submissions are valid will be provided.

A submission consisting of four files (mysys1.RL1, mysys1.RL2, mysys1.RL3, mysys1.RL4) might look like this:

 $ cat mysys1.RL1 
 1 Q0 clueweb09-en0010-21-23199 1 9876 mysys1 
 1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1 
 1 Q0 clueweb09-enwp04-22-09182 3 9874 mysys1 
$ cat mysys1.RL2 1 Q0 clueweb09-en0010-21-23200 1 9963 mysys1 1 Q0 clueweb09-en0481-51-84332 2 9960 mysys1 1 Q0 clueweb09-enwp04-22-09992 3 9954 mysys1 ...
$ cat mysys1.RL3 1 Q0 clueweb09-en0010-21-23199 1 9877 mysys1 1 Q0 clueweb09-en0481-51-62342 2 9875 mysys1 1 Q0 clueweb09-enwp04-22-09992 3 9870 mysys1 ...
$ cat mysys1.RL4 1 Q0 clueweb09-en0010-21-23200 1 9999 mysys1 1 Q0 clueweb09-en0010-21-23199 2 9998 mysys1 ...

Your submission should be sorted by rank within topic number.  By implication it will also be sorted in descending order of score.

If you would normally return no documents for a query, instead return the single document "clueweb09-en0000-00-00000" at rank one.  Doing so maintains consistent evaluation results (averages over the same number of queries) and does not break anyone's tools.

Various rules

  1. A manual run is one in which a person is somehow involved in the process of converting a query into a ranked list, whether by formulating the query by hand, modifying the query by hand, classifying the query by hand, or adjusting the ranked list by hand.  
  2. If you used the provided topic descriptions in any way, say so when you submit your runs.