Track Guidelines

See the track's web site at http://ir.cis.udel.edu/sessions for up-to-date information.

Table of Contents

  1. Tasks
    1. Task 1
    2. Task 2
  2. Track data
    1. Corpus
    2. Topics
    3. Sessions
  3. Evaluation
    1. Relevance judgments and judging
    2. Evaluation measures
  4. Important dates
  5. How to participate
  6. Run submissions

Research in Information Retrieval has traditionally focused on serving the best results for a query, for varying definitions of "best": for ad hoc tasks, the most relevant results; for novelty and diversity tasks, the results that do the best job of covering a space of information needs; for known-item tasks, the result the user is looking for; for filtering tasks, the latest and most relevant results. But users often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate their query and/or information need several times before they find either the thing or every thing they are looking for. A search engine may be able to better serve a user not by ranking the most relevant results to each query in the sequence, but by ranking results that help "point the way" to what the user is really looking for, or by complementing results from previous queries in the sequence with new results, or in other currently-unanticipated ways.

We call a sequence of reformulations (along with any user interaction with the retrieval system) in service of satisfying an information need a session, and the goal of this track is: (G1) to test whether systems can improve their performance for a given query by using previous queries and user interactions with the retrieval system (including clicks on ranked results, dwell times, etc.), and (G2) to evaluate system performance over an entire query session instead of a single query.

Tasks

Task 1:

Task 1 will only evaluate (G1). Participants will be provided with a set of query sessions. Each session will consist of,

Participants may participate in Task 1 with either of the following two options:

  1. Submit the current query to their retrieval system and rank documents for the following three cases:
    • ignoring all information in the session log (RL1).
    • using only information from the prior history in the same session (RL2)
    • using any information available in the session log (RL3)
  2. Re-rank an RL1 ranking of documents provided here by the track coordinators for the following two cases:
    • using only information from the prior history in the same session (RL2)
    • using any information available in the session log (RL3)
You may submit runs for both, but we will accept only three submissions per participating group.

The official evaluation is increase in nDCG@10 from RL1 to RL3 for the first 240 provided sessions. Comparing the retrieval effectiveness in (RL1) with the retrieval effectiveness in (RL2)--(RL3) we evaluate the extent to which a retrieval system can use information prior to a query and information from other sessions to improve the results of the current query. Groups/submissions will be compared by how much of an increase they achieve.

Note that this is not an interactive track. Query sessions will be provided by NIST.

Task 2:

Task 2 is our pilot attempt to evaluate (G2). It will involve interaction with a simulated user. Participants will connect to a remote server, sending retrieval results and receiving user actions in an iterative process. Details have not been fixed yet, please stay tuned.

Task 2 will be open to participants for a limited window of time in late July continuing into August. You will be able to complete your run any time during that window. Completing a run should take less than a day (depending on the complexity of your system). Runs completed during this time will be judged by NIST assessors

The simulation system will continue to be open after the window has closed, but there is no guarantee that runs completed after that will be judged (or that the system will use the same simulation models).

Track data

Corpus

The track will use the
ClueWeb12 collection. The full collection consists of roughly 700 million English-language web pages. Participants are encouraged to use the entire collection, however submissions over the smaller "Category B" collection of 50 million documents will be accepted. Note that Category B submissions will be evaluated as if they were Category A submissions.

Topics

For the 2014 track, we will re-use topics from previous Session tracks, but with new user sessions. Specifically, we will use the following topic numbers: You may use the 2012 and 2013 Session track sessions for these topics for training.

Sessions

Rather than distributing topics as most tracks do, the session track distributes sessions of interactions of users in the process of satisfying some information need. As in previous years, sessions have been collected from actual user search activity for topics derived from previous years' TREC tracks. This year, users worked with a search system built on
indri rather than Yahoo! BOSS as in previous years.

Session data is distributed in an XML file format. For example:


<session num="1" starttime="0">
   <topic>
       <interaction num="1" starttime="10.280644" >
          <query>wikipedia cosmetic laser treatment</query>
          <results>
             <result rank="1">
                <url>http://www.veindirectory.org/content/varicose_veins.asp</url>
                <title>Varicose Veins - Vein Treatment, Removal, Surgery Information</title>
                <snippet>... concern but can lead to more severe problems such as leg pain, leg 
                               swelling and leg cramps. View photos and find a varicose vein treatment 
                               center. ...</snippet>
             </result>
             <result rank="2">
                <url>http://www.peachcosmeticmedicine.com/treatments-Laser-and-IPL-hair-removal.html</url>
                <title>Laser and IPL hair removal - Treatments - Peach Cosmetic ...</title>
                <snippet>Laser hair removal served as Dr Mahony's introduction to cosmetic medicine 
                               back in 1999. ... Both our IPL and our laser offer skin chilling as part of
                               the treatment. ...</snippet>
             </result>

                    ...

             <result rank="10">
                <url>http://www.cosmeticsurgery10.com/index.html</url>
                <title>Cosmetic Surgery, Cosmetic Doctors, Cosmetic Physicians, and ...</title>
                <snippet>Cosmetic Surgery 10 is a resource that provides key information on cosmetic 
                               surgeries focusing on plastic surgeries, dermatology, cosmetic dentists and 
                               LASIK procedures.</snippet>
             </result>
          </results>
          <clicked>
             <click num="1" starttime="95.603468" endtime="120.565420">
                <rank>10</rank>
             </click>
             <click num="2" starttime="138.244467" endtime="181.841436>
                <rank>2</rank>
             </click>
          </clicked>
       </interaction>
       <interaction num="2">
                ...
       </interaction>

                ...

       <currentquery starttime="252.659006">
          <query>uses for cosmetic laser treatment</query>
       </currentquery>
   </topic>
</session>


When constructing your submissions, please follow these guidelines for using information in the XML file:

  1. For your RL1 submission(s), ignore all <interaction> blocks. Use only the <currentquery>, or the baseline results provided at this link.
  2. For your RL2 submission(s), you may use the <clicked>, <result>, <query>, and <currentquery> blocks, but only from the same session as the <currentquery>. (This is equivalent to an RL4 run from previous years.)
  3. For your RL3 submission(s), you may use the full session data (<clicked>, <result>, <query>, and <currentquery> blocks from any or all sessions) to try to improve results for each <currentquery>.
Please do not use any information in the <topic> field!

Evaluation

Relevance judgments and judging

Judging will be done by assessors at NIST. Assessors will judge documents with respect to an information need provided by NIST. (Note: information needs will not be provided to participants.)

Documents in the depth-10 pool will be judged for relevance so that evaluation metrics such as PC(10) and nDCG(10) can be exactly estimated. If further resources are available, additional judgments will be done by randomly selected documents further down the ranked lists (in a Terabyte style). The aforementioned evaluation metrics will be used in G1, i.e. to evaluate the effectiveness of retrieval systems over the ranked lists, RL1 -- RL3 and compare their performance.

Evaluation measures

Task 1: The primary measure by which systems will be compared is increase in nDCG@10 from RL1 to RL3 for the last query in the session, that is, the query in the <currentquery> field in the track data.

Task 2: For the pilot run of Task 2, we will use the session measures presented by Kanoulas et al. and Jarvelin et al.'s session-nDCG.

Important dates

June Guidelines finalized.
mid JulySessions released.
September 10 Results submission deadline.
early October Relevance judgments and individual evaluation scores due back to participants (estimated).
November 18-21 TREC 2014 conference at NIST in Gaithersburg, MD, USA (you must participate in some track to attend).

How to participate

To participate in the Session track, you must
apply to participate in TREC with NIST. You must apply if you wish to participate, and you must participate (in some track) if you wish to attend the TREC meeting in November.

Run submissions

Official run protocol

  1. Download the sessions from the TREC Active Participants page (password required).
  2. (Optional) Download the RL1 baseline from this link.
  3. Run the last query (the one in the <currentquery> block) in each session once for each of the three conditions; put the results in the format described below.
  4. Upload the ranked lists to NIST (a link will be provided; you will need the TREC password).  
You may provide up to three submissions for judging. Each submission should include up to three runs (RL1, RL2, RL3) (you do not need to provide a run for all three conditions if you choose not to). Each submission can have up to 2,000 documents ranked for each query (you may have fewer, but not more).  If you are returning zero documents for a query, instead return the single document "clueweb12-0000wb-00-00000".

Submission format

Each submission includes up to three separate ranked result lists for all topics. Files should be named "runTag.RLn", where "runTag" is a unique identifier for your group and the particular submission, and "RLn" is RL1, RL2, or RL3, depending on the experimental condition. An RL1 submission must be present for a valid submission. Each file should be in standard NIST format: a single ASCII text file with white space separating columns.  The width of the columns is not important but you must have exactly six columns per line with at least one space between the columns.

The contents of the columns are:

  1. The first column is the topic number.
  2. The second column should always contain the string "Q0" (letter "Q" followed by number "0").
  3. The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
  4. The fourth column is the rank of that document for that query.  Within a query, each of the numbers from 1 to 2000 should appear exactly once.  (If you retrieve fewer than 2000 documents, then you will have the numbers from 1 to that number.)
  5. The fifth column is the score your system generated to rank this document, either as an integer or a floating point number.  Scores must be in descending order.  Note that typical TREC evaluations use this column, not the rank column, to evaluate systems.  If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  6. The sixth column is your "run tag" and is a unique identifier for your group and the particular run.  Please change the tag from year to year, track to track, and run to run, so that different approaches can be compared.  Run tags may contain 12 or fewer letters and numbers with no punctuation (and no white space, or the line would have more than six columns). Each of the three files comprising a submission should have the same run tag.
A script for verifying that submissions are valid will be provided.

A submission consisting of three files (mysys1.RL1, mysys1.RL2, mysys1.RL3) might look like this:

 
 $ cat mysys1.RL1 
 1 Q0 clueweb12-0010wb-21-23199 1 9876 mysys1 
 1 Q0 clueweb12-0481wb-51-62342 2 9875 mysys1 
 1 Q0 clueweb12-1004tw-22-09182 3 9874 mysys1 
 ... 
 
$ cat mysys1.RL2 1 Q0 clueweb12-0010wb-21-23200 1 9963 mysys1 1 Q0 clueweb12-0481wb-51-84332 2 9960 mysys1 1 Q0 clueweb12-1204tw-22-09992 3 9954 mysys1 ...
$ cat mysys1.RL3 1 Q0 clueweb12-0010wb-21-23199 1 9877 mysys1 1 Q0 clueweb12-0481wb-51-62342 2 9875 mysys1 1 Q0 clueweb12-1004tw-22-09992 3 9870 mysys1 ...

Your submission should be sorted by rank within topic number.  By implication it will also be sorted in descending order of score.

If you would normally return no documents for a query, instead return the single document "clueweb12-0000wb-00-00000" at rank one.  Doing so maintains consistent evaluation results (averages over the same number of queries) and does not break anyone's tools.

Various rules

  1. An RL1 submission is required by all participants. If you are re-ranking the provided documents, please submit that file as your RL1 run (be sure to change the "run tag" field appropriately). RL2 and RL3 are optional but highly recommended. Your submission will only be evaluated against others if it includes an RL2 and/or RL3 run in addition to the RL1 run.
  2. A manual run is one in which a person is somehow involved in the process of converting a query into a ranked list, whether by formulating the query by hand, modifying the query by hand, classifying the query by hand, adjusting the ranked list by hand, or using data in the <topic> field.  
  3. If you used the provided topic descriptions in any way, say so when you submit your runs.