Track Guidelines

See the track's web site at for up-to-date information.

Table of Contents

  1. Tasks
    1. Task 1
    2. Task 2
  2. Track data
    1. Corpus
    2. Sessions
  3. Evaluation
    1. Relevance judgments and judging
    2. Evaluation measures
  4. Important dates
  5. How to participate
  6. Run submissions

Research in Information Retrieval has traditionally focused on serving the best results for a query, for varying definitions of "best": for ad hoc tasks, the most relevant results; for novelty and diversity tasks, the results that do the best job of covering a space of information needs; for known-item tasks, the result the user is looking for; for filtering tasks, the latest and most relevant results. But users often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate their query and/or information need several times before they find either the thing or every thing they are looking for. A search engine may be able to better serve a user not by ranking the most relevant results to each query in the sequence, but by ranking results that help "point the way" to what the user is really looking for, or by complementing results from previous queries in the sequence with new results, or in other currently-unanticipated ways.

We call a sequence of reformulations (along with any user interaction with the retrieval system) in service of satisfying an information need a session, and the goal of this track is: (G1) to test whether systems can improve their performance for a given query by using previous queries and user interactions with the retrieval system (including clicks on ranked results, dwell times, etc.), and (G2) to evaluate system performance over an entire query session instead of a single query.


Task 1:

Task 1 will only evaluate (G1). Participants will be provided with a set of query sessions. Each session will consist of,

Participants will then run their retrieval system over the current query,

Comparing the retrieval effectiveness in (RL1) with the retrieval effectiveness in (RL2)--(RL3) we can evaluate the extent to which a retrieval system can use information prior to a query and information from other sessions to improve the results of the current query.

Note that this is not an interactive track. Query sessions will be provided by NIST.

Task 2:

Task 2 is our pilot attempt to evaluate (G2). It will involve interaction with a simulated user. Participants will connect to a remote server, sending retrieval results and receiving user actions in an iterative process. Details have not been fixed yet, please stay tuned.

Task 2 will be open to participants for a limited window of time in late July continuing into August. You will be able to complete your run any time during that window. Completing a run should take less than a day (depending on the complexity of your system). Runs completed during this time will be judged by NIST assessors

The simulation system will continue to be open after the window has closed, but there is no guarantee that runs completed after that will be judged (or that the system will use the same simulation models).

Track data


The track will use the
ClueWeb12 collection. The full collection consists of roughly 700 million English-language web pages. Participants are encouraged to use the entire collection, however submissions over the smaller "Category B" collection of 50 million documents will be accepted. Note that Category B submissions will be evaluated as if they were Category A submissions.


Rather than distributing topics as most tracks do, the session track distributes sessions of interactions of users in the process of satisfying some information need. As in previous years, sessions have been collected from actual user search activity for topics derived from previous years' TREC tracks. This year, users worked with a search system built on indri rather than Yahoo! BOSS as in previous years.

Session data is distributed in an XML file format. For example:

<session num="1" starttime="0">
       <interaction num="1" starttime="10.280644" >
          <query>wikipedia cosmetic laser treatment</query>
             <result rank="1">
                <title>Varicose Veins - Vein Treatment, Removal, Surgery Information</title>
                <snippet>... concern but can lead to more severe problems such as leg pain, leg 
                               swelling and leg cramps. View photos and find a varicose vein treatment 
                               center. ...</snippet>
             <result rank="2">
                <title>Laser and IPL hair removal - Treatments - Peach Cosmetic ...</title>
                <snippet>Laser hair removal served as Dr Mahony's introduction to cosmetic medicine 
                               back in 1999. ... Both our IPL and our laser offer skin chilling as part of
                               the treatment. ...</snippet>


             <result rank="10">
                <title>Cosmetic Surgery, Cosmetic Doctors, Cosmetic Physicians, and ...</title>
                <snippet>Cosmetic Surgery 10 is a resource that provides key information on cosmetic 
                               surgeries focusing on plastic surgeries, dermatology, cosmetic dentists and 
                               LASIK procedures.</snippet>
             <click num="1" starttime="95.603468" endtime="120.565420">
             <click num="2" starttime="138.244467" endtime="181.841436>
       <interaction num="2">


       <currentquery starttime="252.659006">
          <query>uses for cosmetic laser treatment</query>

(Note: for 2013, starttime attributes are given relative to the session start time.)

Each submission should give ranked results for the <currentquery> for each of the first 87 sessions (numbered consecutively 1-87). Sessions numbered 88-133 consist of only one query; they can be used to fit models or for any other purpose.

When constructing your submissions, please follow these guidelines for using information in the XML file:

  1. For your RL1 submission(s), ignore all <interaction> blocks. Use only the <currentquery>.
  2. For your RL2 submission(s), you may use the <clicked>, <result>, <query>, and <currentquery> blocks, but only from the same session as the <currentquery>. (This is equivalent to an RL4 run from previous years.)
  3. For your RL3 submission(s), you may use the full session data (<clicked>, <result>, <query>, and <currentquery> blocks from any or all sessions) to try to improve results for each <currentquery>. Only the RL3 run should use information from sessions 88-133.
Please do not use any information in the <topic> field!


Relevance judgments and judging

Judging will be done by assessors at NIST. Assessors will judge documents with respect to an information need provided by NIST. (Note: information needs will not be provided to participants.)

Documents in the depth-10 pool will be judged for relevance so that evaluation metrics such as PC(10) and nDCG(10) can be exactly estimated. If further resources are available, additional judgments will be done by randomly selected documents further down the ranked lists (in a Terabyte style). The aforementioned evaluation metrics will be used in G1, i.e. to evaluate the effectiveness of retrieval systems over the ranked lists, RL1 -- RL3 and compare their performance.

Evaluation measures

Task 1: The primary measure by which systems will be compared is nDCG@10 on the current query in the session. Thus you can do well on the track simply by having a good ad hoc system, but you can potentially do better if you are able to make use of the session prior to the current query.

Task 2: For the pilot run of Task 2, we will use the session measures presented by Kanoulas et al. and Jarvelin et al.'s session-nDCG.

Important dates

Early June Guidelines finalized; sample sessions released.
Mid JuneSessions released.
September 10 (extended) Results submission deadline.
early October Relevance judgments and individual evaluation scores due back to participants (estimated).
November 19-22 TREC 2013 conference at NIST in Gaithersburg, MD, USA (you must participate in some track to attend).

How to participate

To participate in the Session track, you must
apply to participate in TREC with NIST. You must apply if you wish to participate, and you must participate (in some track) if you wish to attend the TREC meeting in November.

Run submissions

Official run protocol

  1. Download the topics from the TREC Active Participants page (password required).
  2. Run the last query in each topic once for each of the three conditions; put the results in the format described below.
  3. Upload the ranked lists to NIST (a link will be provided; you will need the TREC password).  
You may provide up to three submissions for judging. Each submission should include up to three runs (RL1, RL2, RL3) (you do not need to provide a run for all three conditions if you choose not to). Each submission can have up to 2,000 documents ranked for each query (you may have fewer, but not more).  If you are returning zero documents for a query, instead return the single document "clueweb12-0000wb-00-00000".

Submission format

Each submission includes three separate ranked result lists for all topics. Files should be named "runTag.RLn", where "runTag" is a unique identifier for your group and the particular submission, and "RLn" is RL1, RL2, or RL3, depending on the experimental condition. All three files must be present for a valid submission. Each file should be in standard NIST format: a single ASCII text file with white space separating columns.  The width of the columns is not important but you must have exactly six columns per line with at least one space between the columns.

The contents of the columns are:

  1. The first column is the topic number.
  2. The second column should always contain the string "Q0" (letter "Q" followed by number "0").
  3. The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
  4. The fourth column is the rank of that document for that query.  Within a query, each of the numbers from 1 to 2000 should appear exactly once.  (If you retrieve fewer than 2000 documents, then you will have the numbers from 1 to that number.)
  5. The fifth column is the score your system generated to rank this document, either as an integer or a floating point number.  Scores must be in descending order.  Note that typical TREC evaluations use this column, not the rank column, to evaluate systems.  If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  6. The sixth column is your "run tag" and is a unique identifier for your group and the particular run.  Please change the tag from year to year, track to track, and run to run, so that different approaches can be compared.  Run tags may contain 12 or fewer letters and numbers with no punctuation (and no white space, or the line would have more than six columns). Each of the three files comprising a submission should have the same run tag.
A script for verifying that submissions are valid will be provided.

A submission consisting of three files (mysys1.RL1, mysys1.RL2, mysys1.RL3) might look like this:

 $ cat mysys1.RL1 
 1 Q0 clueweb12-0010wb-21-23199 1 9876 mysys1 
 1 Q0 clueweb12-0481wb-51-62342 2 9875 mysys1 
 1 Q0 clueweb12-1004tw-22-09182 3 9874 mysys1 
$ cat mysys1.RL2 1 Q0 clueweb12-0010wb-21-23200 1 9963 mysys1 1 Q0 clueweb12-0481wb-51-84332 2 9960 mysys1 1 Q0 clueweb12-1204tw-22-09992 3 9954 mysys1 ...
$ cat mysys1.RL3 1 Q0 clueweb12-0010wb-21-23199 1 9877 mysys1 1 Q0 clueweb12-0481wb-51-62342 2 9875 mysys1 1 Q0 clueweb12-1004tw-22-09992 3 9870 mysys1 ...

Your submission should be sorted by rank within topic number.  By implication it will also be sorted in descending order of score.

If you would normally return no documents for a query, instead return the single document "clueweb12-0000wb-00-00000" at rank one.  Doing so maintains consistent evaluation results (averages over the same number of queries) and does not break anyone's tools.

Various rules

  1. A manual run is one in which a person is somehow involved in the process of converting a query into a ranked list, whether by formulating the query by hand, modifying the query by hand, classifying the query by hand, or adjusting the ranked list by hand.  
  2. If you used the provided topic descriptions in any way, say so when you submit your runs.