Track Guidelines

DRAFT: Much information is tentative.

See the track's web site at for up-to-date information.

Table of Contents

  1. Task
  2. Background
  3. Important dates
  4. How to participate
  5. Track data
    1. Corpus
    2. Queries
  6. Run submissions
  7. Relevance judgments and judging
  8. Selection of queries and documents
    1. Query selection
    2. Document selection
  9. Evaluation criteria
The Million Query Track serves two purposes.  First, it is an exploration of ad-hoc retrieval on a large collection of documents.  Second, it investigates questions of system evaluation, particularly whether it is better to evaluate using many shallow judgments or fewer thorough judgments and whether small sets of judgments are reusable.

Participants in this track will run up to 40,000 queries against a large collection of web documents at least once. These queries will be classified by assessors as "precision-oriented" or "recall-oriented". Participants can, if so motivated, try to determine what the query class is and choose ranking algorithms specialized for each class.

This is the third year that the Million Query track is running. Results from the previous two years have established that retrieval systems can be as accurately evaluated with many shallowly judged topics as with few deeply judged topics. This year our primary goals are:


The task is a standard ad hoc retrieval task, with the added feature that queries will be classified by hardness and by whether the user's intent is to find as much information about the topic as possible ("recall-oriented"), or to find one or a few highly-relevant documents about one particular aspect ("precision-oriented").

Classes will be assigned by assessors according to a categorization of user search goals based on work by Rose and Levinson. Assessors will choose one of the six categories to indicate the type of information need, and we will map the categories to two broader classes as follows:

Query hardness will be established post hoc by thresholding the average of estimated average precisions for each topic.

Participants may attempt to use query class prediction to choose a retrieval approach. Class predictions may be submitted as well, and in that case prediction success will be part of the track evaluation. Each topic will be evaluated by standard AP estimates (ignoring class predictions), as well as by precision-based measures or recall-based measures depending on class.

As an example of the classification task, the query "district of columbia department of motor vehicles" (from the 2008 1MQ track) might be classified as "precision-oriented", as its intent seems navigational in nature. The query "vietnam vets and agent orange" might be classified as "recall-oriented", as it seems to reflect a broad, open-ended informational intent.

Additionally, participants can choose whether to run 1,000, 4,000, 10,000 or all 40,000 queries. Completed query set size will be compared to evaluation results.

Track background

You may find the following background information useful:

Important dates

Dates have not been finalized, but these should provide some tentative sense of the timeline.

How to participate

To participate in the Million Query track, you must
apply to participate in TREC with NIST.  (The official deadline is in mid February, but late applications are sometimes accepted.)  Note that applying is not an onerous process and it is unusual not to be accepted.  However, you must apply if you wish to participate, and you must participate (in some track) if you wish to attend the TREC meeting in November.

Participants in this track will be expected to run a large number of queries against a collection of documents and to return a ranked list of documents for judging. Participants may also submit class predictions, and may also be allowed to submit retrieval results for each class for each query (up to four result sets per query if using different algorithms for different classes).

Track data


This track will use the new
ClueWeb09 collection. This corpus is a collection of web pages crawled in early 2009. It includes 1 billion documents in 25 terabytes. The 1MQ track will use the smaller "Category B" set of 50 million documents in 1 terabyte.

The collection is available from Carnegie Mellon University, distributed on a hard disk that will be shipped to you (you get to keep the disk).  The Category B set can be obtained for US$240 (to cover the costs of the drives and preparing the data) plus shipping. Full details on how to acquire the ClueWeb collection are available here.


Topics for this task will be drawn from a large collection of queries that were collected by a large Internet search engine.
Queries will be title-length (in TREC parlance). During judging, queries will be developed into full-blown TREC topics.

The track will include forty thousand (40,000) queries, possibly including some that overlap with those used by the Relevance Feedback and/or Web tracks this year. The queries will be randomly selected from a query log. Note that no quality control will be imposed on the 40,000 selected queries.  We hope that most of them are good quality queries, but some are likely to be partially or entirely non-English, some may contain spelling errors, some may be incomprehensible to anyone other than the person who originally created them.

Query Prioritization. The queries will be prioritized. Every participant must do the first 1000; additional queries are optional and will be considered for throughput evaluation. Participants must do the first 1000, the first 4,000, the first 10,000, or all 40,000. Totals in between those will be reduced to the nearest lower group for effectiveness evaluation (e.g. a site that completes 2500 queries will be treated as having completed the first 1000 only for purposes of effectiveness evaluation, though all 2500 would be considered for query throughput evaluation).

Queries will be distributed in a text file where each line has the format "N:P:query word or words".  Here, N is the query number (followed by a colon delimiter), P is the priority (a number 1-4, with 1 indicating highest priority), followed by the query itself.  For example, the line (from the TREC 2007 training file) "32:1:barack obama internships" means that query number 32 is the 3-word query "barack obama internships" and is a high-priority (i.e. required) query.  All queries are in lowercase.  Some queries contain punctuation, but there is no guarantee that the punctuation is intended to provide any sort of "query language": in particular, most single quotation marks are apostrophes, not quotation marks. If any queries contain a colon, a different delimiting character will be chosen and these guidelines will be updated accordingly.

Run submissions

Various rules

  1. A manual run is one in which a person is somehow involved in the process of converting a query into a ranked list, whether by formulating the query by hand, modifying the query by hand, classifying the query by hand, or adjusting the ranked list by hand.   With 40,000 queries, it is unlikely that there will be many manual runs.
  2. A system may not be modified in light of the set of queries sent.  You should not look at the evaluation queries before you are ready to run your system.

Official run protocol

  1. Download the 40,000 queries from the TREC active participants web site.
  2. Run those queries to generate up to 40,000 ranked lists and put them in NIST standard format.  You can verify the format of your data using a script provided by NIST.  (Use the script: NIST will reject a submission that doesn't pass the script, and you'll end up doing the upload again.)
  3. Upload the ranked lists to NIST (a link will be provided; you will need the TREC password).  
You may submit up to five runs for judging. Each run should have 1,000 documents ranked for each query, though may have fewer.  If you are returning zero documents for a query, instead return the single document "clueweb09-en0000-00-00000".

Submission format

The standard NIST format for submissions is a single ASCII text file.  White space is used to separate colums.  The width of the columns is not important but you must have exactly six columns per line with at least one space between the colums.

The contents of the columns are:

  1. The first column is the topic number.
  2. The second column may be unused, or it may be used for query class predictions. If unused, it should always contain the string "Q0" (letter "Q" followed by number "0"). If used for query class prediction, it should contain one of the following character strings:
  3. The third column is the official document number of the retrieved document, found in the <DOCNO> field of the document.
  4. The fourth column is the rank of that document for that query.  Within a query, each of the numbers from 1 to 1000 should appear exactly once.  (If you retrieve fewer than 1000 documents, then you will have the numbers from 1 to that number.)
  5. The fifth column is the score your system generated to rank this document, either as an integer or a floating point number.  Scores must be in descending order.  Note that typical TREC evaluations use this column, not the rank column, to evaluate systems.  If you want the precise ranking you submit to be evaluated, the scores must reflect that ranking.
  6. The sixth column is your "run tag" and is a unique identifier for your group and the particular run.  Please change the tag from year to year, track to track, and run to run, so that different approaches can be compared.  Run tags may contain 12 or fewer letters and numbers with no punctuation (and no white space, or the line would have more than six columns).

For example, a run that is not attempting the classification task might look like this:

100 Q0 clueweb09-en0010-21-23199 1 9876 mysys1
100 Q0 clueweb09-en0481-51-62342 2 9875 mysys1
100 Q0 clueweb09-enwp04-22-09182 3 9874 mysys1

A run that has attempted both classification tasks might look like this:

100 PE clueweb09-en0010-21-23199 1 9876 mysys1
100 PE clueweb09-en0481-51-62342 2 9875 mysys1
100 PE clueweb09-enwp04-22-09182 3 9874 mysys1

Your submission should be sorted by rank within topic number.  By implication it will also be sorted in descending order of score.

If you would normally return no documents for a query, instead return the single document "clueweb09-en0000-00-00000" at rank one.  Doing so maintains consistent evaluation results (averages over the same number of queries) and does not break anyone's tools.

For the Million Query Track, the submission file will be quite large (between 1,000,000 and 40,000,000 lines of text).  All submission files must therefore be compressed before being uploaded.  Use either gzip or bzip2 for the compression.

Plan on your uploads taking several hours.  Some tips from experienced submitters:

  1. Run the program before you try to upload a large file.  NIST will reject your submission if it does not pass the checking program, so you should try it yourself first.
  2. Make sure you fill out the submission page information completely and in the correct form.  NIST will upload your run before validating the information in the form.  If you made a mistake, you will have to upload the data again.

Relevance judgments and judging

Judging will be done by assessors at NIST.  Participants are welcome (encouraged!) to provide judgments, too, though the track cannot count on that.

The process will look roughly like this from the perspective of someone judging:
  1. The assessment system will present 10 queries or query "sessions" (one or more queries grouped together) randomly selected from the evaluation set.
  2. The assessor will select one of those ten queries or sets of queries to judge.  The others will be returned to the pool.
  3. The assessor will provide the description and narrative parts of the query, creating a full TREC topic.  This information will be used by the assessor to keep focus on what is relevant. The assessor will also provide a class (selected from a drop-down menu) such as informational or navigational that we will map to "precision-oriented" or "recall-oriented".
  4. The system will present a web page and ask whether it is relevant to the query.  Judgments will be on the same four-way scale used in the 2008 track: highly relevant, relevant, reasonable but not relevant, or not relevant.  Consistent with past practice, the distinction between the first two and the last two will be up to the assessor.
  5. After judgments of "relevant" or "highly relevant", the assessor may be asked if they are satisfied with the relevant material they have seen so far. This may provide more information for predicting ease/difficulty and precision/recall.
  6. Documents will be presented until reaching some pre-determined stopping condition as discussed below.
The system for carrying out these judgments will be the same system that was used for the 2007 and 2008 tracks.

Selection of queries and documents for judging

With 40,000 queries and limited assessor budget, the queries and documents selected for judging must be carefully chosen to ensure high power in statistical tests.

Query selection

The 10 queries shown to assessors will be selected to ensure that the highest-priority 1000 queries receive more total judging effort than the next 3000, which in turn receive more total judging effort than the last 6000.

In addition, some queries will be designated for studying reusability. For these queries, one or more sites' runs will be selected at random to be held out of the judging process, so that those runs will not contribute to the judgments. This will make it possible to study the power of conclusions drawn from an evaluation with a small set of judgments.

Our goal is to have n baseline queries for which all sites contributed judgments (with n large enough to ensure robust baseline conclusions about systems), and another m queries for which k sites' runs were held out (with m and k chosen to ensure that each site is treated equally, and the resulting number of held-out queries large enough that comparisons between systems on those queries alone are robust). This will ensure that the following system comparisons are possible (using e.g. paired t-tests or ANOVA):

  1. System-to-system on baseline queries (baseline evaluation within-site and between-site)
  2. Within-site system-to-system on queries the site was held out of (within-site reusability)
  3. Between-site system-to-system on queries that both sites were held out of (between-site reusability)
  4. Between-site system-to-system on queries that site A was held out of and site B was not held out of (reusability of judgments for comparing to original TREC results)
Additional types of comparisons will be possible using appropriate tools; more details of our planned design and statistical analyses can be found
here (not ready yet).

Document selection

Two approaches to selecting documents will be used in TREC 2009:
  1. "Minimal Test Collections" (MTC).  In this method, documents are selected by how much they inform us about the difference in mean average precision given all the judgments that were made up to that point.  Because average precision is quadratic in relevance judgments, the amount each relevant document contributes is a function of the total number of judgments made and the ranks they appear at. Nonrelevant documents also contribute to our knowledge:  if a document is nonrelevant, it tells us that certain terms cannot contribute anything to average precision.  We quantify how much a document will contribute if it turns out to be relevant or nonrelevant, then select the one that we expect to contribute the most. A SIGIR 2006 paper by Carterette, Allan, and Sitaraman, describing this approach is available in the ACM Digital Library or here.
  2. Statistical sampling.  This method draws and judges a specific random sample of documents from the given ranked lists and produces unbiased, low-variance estimates of average precision, R-precision, and precision at standard cutoffs from these judged documents.  Additional (non-random) judged documents may also be included in the estimation process, further improving the quality of the estimates.  For more details, here is a working draft of a paper by Aslam, Pavlu, and Yilmaz, describing this approach (note that it is a working draft, so may change periodically).  
More details about the implementation and application of these methods can be found in previous years' track reports (1MQ 2007, 1MQ 2008).

Each query will be judged by alternating between these two methods. Neither method will have knowledge of the other's judgments. Documents selected by both methods will only be shown to the assessor once.

Each query will receive 16, 32, 64, or 128 judgments. This number will be determined after topic definition and will be selected to satisfy the following constraints:

Evaluation criteria

There will be two separate evaluations: a standard ad hoc evaluation by estimates of average precision, and a precision/recall-oriented evaluation by precision or recall measures.

For the first, each topic will be evaluated by two estimates of average precision based on the two document selection algorithms described above (independent of classification). The final track report from TREC 2007, a related SIGIR 2008 paper, and a recent ECIR 2009 paper present analysis of the measures, their accuracy, and their differences.

For the second, each query will be evaluated by a precision measure and a recall measure depending on its assessed classification.