«Abstract We investigate the effect of different sources of relevant documents in the creation of a test collection in the scientiﬁc domain. Based ...»
Creating a Test Collection:
Relevance Judgements of Cited & Non-cited Papers
Anna Ritchie Stephen Robertson Simone Teufel
University of Cambridge Microsoft Research Ltd University of Cambridge
Computer Laboratory Roger Needham House Computer Laboratory
15 J J Thompson Avenue 7 J J Thomson Avenue 15 J J Thompson Avenue Cambridge, CB3 0FD, U.K. Cambridge, CB3 0FB, U.K. Cambridge, CB3 0FD, U.K.
email@example.com firstname.lastname@example.org email@example.com Abstract We investigate the effect of different sources of relevant documents in the creation of a test collection in the scientiﬁc domain.
Based on the Cranﬁeld 2 design, paper authors are asked to judge their cited papers for relevance in the ﬁrst stage. In a second stage, documents outside the reference list are judged. In this paper, we use the test collection with standard IR engines to compare the information contained in the judgements of the ﬁrst vs second stage. Using different correlation studies, we found that the judgements of the cited papers do not predict those from the non-cited papers, which means that the combination of sources results in a higher quality collection.
1 Introduction Building a test collection is a long and expensive process but is sometimes necessary when no ready-made collection with the right properties exists. We aim to improve term-based IR on scientiﬁc papers with citation information, by using terms from the citing document to additionally describe (i.e., index) the cited document. We needed a test collection with full text for many citing and cited documents. A high proportion of citations from documents in the collection to other collection documents will be most useful; we built our test collection around the ACL Anthology 1, since we empirically found Computational Linguistics to be a relatively self-contained ﬁeld.
The idea of using terms external to a document for indexing, coming from a ‘citing’ document, is also used in web IR. Citations are quite like hyperlinks and link structure, particularly anchor text, has been used to advantage in retrieval tasks (McBryan, 1994; Hawking and Craswell, 2005).
While web pages are often poorly self-descriptive (Brin and Page, 1998), anchor text is often a higher-level description of the pointed-to page (Davison, 2000). Some work has been done in this area, e.g., (Bradshaw, 2003; Dunlop and van Rijsbergen, 1993). However, previous experiments and test collections have had only limited access to the content of theciting and/or cited documents: (Bradshaw, 2003) found index terms in Citeseer citation contexts rather than full texts, (Dunlop and van Rijsbergen, 1993) experimented on the CACM collection of abstracts and the GIRT collection (Kluck, 2003), likewise, consists of content-bearing ﬁelds, not full documents.
The original TREC Genomics collection2 consists of MEDLINE records, containing abstracts but not full papers (Hersh and Bhupatiraju, 2003). Our test collection must contain full text for many http://www.aclweb.org/anthology/ In the 2006 track, a new collection of full-text documents was introduced but this was not available when our work began (Hersh et al., 2006). Its suitability as a test collection for citation-related work, e.g., the proportion of internal citations, has not yet been established.
citing and cited documents. It should, thus, help to address the research question of how to use citations between documents for IR.
To turn a document collection into a test collection, a parallel set of search queries and relevance judgements is needed. There are a number of alternative methods for building a test collection. For TREC, humans devise queries speciﬁcally for a given set of documents and make relevance judgements on pooled retrieved documents from that set (Harman, 2005). This is too labour-intensive for our project, particularly as we use scientiﬁc papers as data, where deciding on relevance would take even more time than for newspaper text. We, instead, adapted the methodology from the Cranﬁeld 2 tests (Cleverdon et al., 1966), which is speciﬁc to scientiﬁc texts.
The Cranﬁeld test collection was built by asking authors to formulate the research question(s) behind their work and to judge how relevant each reference in their paper was to each of their research questions. From a base collection of 182 (high speed aerodynamics and aircraft structures) papers, referenced documents were obtained and added. The collection was further expanded in a second stage, using bibliographic coupling to search for similar papers to the referenced ones and employing humans to search the collection for other relevant papers. The resultant collection comprised 1400 documents and 221 queries (Cleverdon, 1997).
The principles behind the Cranﬁeld technique are:
• Queries: Each paper has an underlying research question(s); these constitute valid search queries.
• Relevant documents: A paper’s reference list is a good starting point for ﬁnding papers relevant to its research questions.
• Judges: The paper author is the person best qualiﬁed to judge relevance.
The source-document principle (i.e., using queries created from documents in the collection) attracted criticism: the fact that the queries were formulated after the cited papers had been read may have inﬂuenced the wording of the queries and, thus, led to a bias towards one particular indexing language (Vickery, 1967). While this may be true, it is far more a problem for Cranﬁeld 2 (which investigated indexing devices per se) than for us, as the indexing language will be kept constant in our experiments. For our purposes, we assume that the source-document principle is sound.
We adapted the Cranﬁeld method to ﬁt a ﬁxed, existing document collection. We designed our methodology around an upcoming (ACL Anthology) conference and approached the paper authors at around the time of the conference, to maximize their willingness to participate and to minimise possible changes in their perception of relevance since they wrote the paper. Hence, the authors of accepted papers for ACL-2005 and HLT-EMNLP-2005 were asked, by email, for their research questions and relevance judgements for their references. Personalised materials for participation were sent, including a reproduction of their paper’s reference list in their response form. This meant that invitations could only be sent once the paper had been made available online.
This resulted in a test collection of 196 queries; however, we commented that the low number of judged relevant documents is potentially problematic (Ritchie et al., 2006). In line with Cranﬁeld, Class Description and Example Typo Corrected spelling or typographical error in the research question, as returned by the author.
Handling biograpical questions with implicature in a question answering system. → Handling biographical questions with implicature in a question answering system.
Filler Removed part(s) of the research question that did not contribute to its meaning, e.g., contentless ‘ﬁller’ phrases or repetitions of existing content.
We present a novel mechanism for improving reference resolution by using the output of a relation tagger to rescore coreference hypotheses. → improving reference resolution by using the output of a relation tagger to rescore coreference hypotheses.
Anaphor Resolved anaphoric references in the research question to ideas introduced in earlier questions from the same author.
How can the best alignment according to the model be found? → How can the best word-alignment according to the weighted linear model be found?
Context Added terms from earlier research questions to provide apparently missing context.
Identifying an appropriate domain → Identifying an appropriate domain - natural language generation
Table 1: Classes of Query Reformulation
therefore, we expanded our test collection to add judgements for non-cited papers. In §2, we present our methodology for this expansion, which we call Phase Two. We brieﬂy survey the relevance data accumulated via our methods. In §3, we describe using our test collection with standard IR tools, comparing results before and after the judgement set is expanded. §4 concludes and outlines future work.
2 Expanding Our Test Collection Whereas the Cranﬁeld expansion also involved adding more documents to the collection, the purpose of our Phase Two was solely to obtain more relevance judgements for the queries from Phase One. Our methodology was as follows.
First, we inspected the research questions returned in Phase One and noted that some were unsuitable as search queries. Mostly, these were artefacts of the method by which the queries were created: we did not explicitly ask the authors for independent search queries. Thus, where an author had returned multiple research questions, the later questions sometimes contained anaphoric references to earlier ones or did not include terms describing the background context of the research (that had been introduced in an earlier question). In addition, some questions contained spelling or typographical errors and some were formulated elaborately or verbosely, with many terms that did not contribute to the underlying meaning, e.g., contentless ‘ﬁller’ phrases or repetitions of existing content. While a good retrieval system should be robust to query imperfections, this is outside the domain of our research. Therefore, we minimally reformulated 34 of the 196 research questions, to turn them into error-free, standalone queries, while keeping them as close to the author’s original research question as possible. Authors were asked to approve our reformulations (i.e., conﬁrm that the reformulated query corresponded to their intentions) or to correct the query, for resubmission to the pooling process. Table 1 describes the four classes of query reformulation. We note that some number of the Cranﬁeld queries were similarly reformulated (Cleverdon et al., 1966).
For each query, we next constructed a list of potentially relevant documents in the Anthology. We ﬁrst ‘manually’ searched the entire Anthology using the Google Search facility on the Anthology website. We started with the author’s complete research question (or our reformulation) as the search query then used successive query reﬁnements or alternatives. These query changes were made depending on the relevance of search results, i.e., relevance according to our intuitions about the query meaning and guided, where necessary, by the author’s Phase One judgements. Our manual searches were not strictly manual in the same sense of the Cranﬁeld manual searches: we did use an automated search tool rather than search through papers by hand. We use the term ‘manual’ to indicate the signiﬁcant human involvement in the searches.
We then ran the queries through three ‘standard’ IR models, implemented in Lemur 3, with standard parameters:
1. Okapi BM25 with relevance feedback
2. KL-divergence LM with relevance feedback and document model smoothing
3. Cosine similarity We pooled the manual and automatic search results, including all manual search results and adding one from each of the automatic retrieved lists (removing duplicates) to make a list of ﬁfteen documents. If there were ﬁfteen or more manual search results, only manual results (and all of these) were included, as these were felt to be more ‘trustworthy’, having already been judged as likely to be relevant. Some lists were, thus, longer than ﬁfteen documents.
The list of potentially relevant documents was then included in personalised materials and sent to the query author for judgement. The materials included instructions and a response form in both plaintext and PDF, including the URL for a webpage with identifying details about the papers for relevance judgement (i.e., title and authors) and links to the papers in PDF, to aid the relevance decision.
We decided to ask for binary relevance judgements for this second round. Firstly, the relevance scale used in Phase One was designed for the speciﬁc task of grading the relevance of referenced papers in relation to the research question underlying the source paper; the grades were described in terms of how important the information in that reference would be to someone reading the paper. Judging the relevance of papers from outside the reference list is a slightly different task, therefore, and would have required a translation of the relevance scale. It was not clear that an exactly equivalent set of grades could have been formulated, such that a Phase One grade 4 was equivalent to a Phase Two grade 4 etc. Furthermore, it was already unclear whether we would be able to make use of the graded relevance judgements from Phase One, since most of the standard evaluation measures use binary relevance, without the added complication of having a new set of graded judgements that weren’t straightforwardly interchangable.
been collapsed in previous studies and shown to give stable evaluation results (Voorhees, 1998).
Additionally, in our case, the binary and graded judgements are made by the same person so we might conjecture that their judgement thresholds are more consistent. Therefore, we changed to binary judgements, in the hope that this would also make the task easier for the authors and encourage a higher response rate.
2.1 Returns and Analysis
Around 500 invitations were sent in Phase One. 85 completed response forms were returned, giving 235 queries with relevance judgements. We discarded queries from co-authors whose ﬁrst author had also responded and queries with no relevant Anthology-internal references, leaving 196 queries, henceforth the All Phase One set.
74 invitations were sent in Phase Two and 44 forms were returned; 82 queries 4. 22 of these had been reformulated and all were approved by the author except two. In both cases, the author submitted an alternative reformulation for pooling and a new list (including the previous manual search results) was sent back for judgement. Both authors judged the (non-duplicate) documents in the new list.
Table 2 compares our test collection, before and after Phase Two, to some other test collections.