WWW.THESIS.DISLIB.INFO
FREE ELECTRONIC LIBRARY - Online materials, documents
 
<< HOME
CONTACTS



Pages:   || 2 | 3 |

«Abstract We investigate the effect of different sources of relevant documents in the creation of a test collection in the scientific domain. Based ...»

-- [ Page 1 ] --

Creating a Test Collection:

Relevance Judgements of Cited & Non-cited Papers

Anna Ritchie Stephen Robertson Simone Teufel

University of Cambridge Microsoft Research Ltd University of Cambridge

Computer Laboratory Roger Needham House Computer Laboratory

15 J J Thompson Avenue 7 J J Thomson Avenue 15 J J Thompson Avenue Cambridge, CB3 0FD, U.K. Cambridge, CB3 0FB, U.K. Cambridge, CB3 0FD, U.K.

ar283@cl.cam.ac.uk ser@microsoft.com sht25@cl.cam.ac.uk Abstract We investigate the effect of different sources of relevant documents in the creation of a test collection in the scientific domain.

Based on the Cranfield 2 design, paper authors are asked to judge their cited papers for relevance in the first stage. In a second stage, documents outside the reference list are judged. In this paper, we use the test collection with standard IR engines to compare the information contained in the judgements of the first vs second stage. Using different correlation studies, we found that the judgements of the cited papers do not predict those from the non-cited papers, which means that the combination of sources results in a higher quality collection.

1 Introduction Building a test collection is a long and expensive process but is sometimes necessary when no ready-made collection with the right properties exists. We aim to improve term-based IR on scientific papers with citation information, by using terms from the citing document to additionally describe (i.e., index) the cited document. We needed a test collection with full text for many citing and cited documents. A high proportion of citations from documents in the collection to other collection documents will be most useful; we built our test collection around the ACL Anthology 1, since we empirically found Computational Linguistics to be a relatively self-contained field.

The idea of using terms external to a document for indexing, coming from a ‘citing’ document, is also used in web IR. Citations are quite like hyperlinks and link structure, particularly anchor text, has been used to advantage in retrieval tasks (McBryan, 1994; Hawking and Craswell, 2005).

While web pages are often poorly self-descriptive (Brin and Page, 1998), anchor text is often a higher-level description of the pointed-to page (Davison, 2000). Some work has been done in this area, e.g., (Bradshaw, 2003; Dunlop and van Rijsbergen, 1993). However, previous experiments and test collections have had only limited access to the content of theciting and/or cited documents: (Bradshaw, 2003) found index terms in Citeseer citation contexts rather than full texts, (Dunlop and van Rijsbergen, 1993) experimented on the CACM collection of abstracts and the GIRT collection (Kluck, 2003), likewise, consists of content-bearing fields, not full documents.

The original TREC Genomics collection2 consists of MEDLINE records, containing abstracts but not full papers (Hersh and Bhupatiraju, 2003). Our test collection must contain full text for many http://www.aclweb.org/anthology/ In the 2006 track, a new collection of full-text documents was introduced but this was not available when our work began (Hersh et al., 2006). Its suitability as a test collection for citation-related work, e.g., the proportion of internal citations, has not yet been established.

citing and cited documents. It should, thus, help to address the research question of how to use citations between documents for IR.

To turn a document collection into a test collection, a parallel set of search queries and relevance judgements is needed. There are a number of alternative methods for building a test collection. For TREC, humans devise queries specifically for a given set of documents and make relevance judgements on pooled retrieved documents from that set (Harman, 2005). This is too labour-intensive for our project, particularly as we use scientific papers as data, where deciding on relevance would take even more time than for newspaper text. We, instead, adapted the methodology from the Cranfield 2 tests (Cleverdon et al., 1966), which is specific to scientific texts.

The Cranfield test collection was built by asking authors to formulate the research question(s) behind their work and to judge how relevant each reference in their paper was to each of their research questions. From a base collection of 182 (high speed aerodynamics and aircraft structures) papers, referenced documents were obtained and added. The collection was further expanded in a second stage, using bibliographic coupling to search for similar papers to the referenced ones and employing humans to search the collection for other relevant papers. The resultant collection comprised 1400 documents and 221 queries (Cleverdon, 1997).

The principles behind the Cranfield technique are:

• Queries: Each paper has an underlying research question(s); these constitute valid search queries.

• Relevant documents: A paper’s reference list is a good starting point for finding papers relevant to its research questions.

• Judges: The paper author is the person best qualified to judge relevance.

The source-document principle (i.e., using queries created from documents in the collection) attracted criticism: the fact that the queries were formulated after the cited papers had been read may have influenced the wording of the queries and, thus, led to a bias towards one particular indexing language (Vickery, 1967). While this may be true, it is far more a problem for Cranfield 2 (which investigated indexing devices per se) than for us, as the indexing language will be kept constant in our experiments. For our purposes, we assume that the source-document principle is sound.





We adapted the Cranfield method to fit a fixed, existing document collection. We designed our methodology around an upcoming (ACL Anthology) conference and approached the paper authors at around the time of the conference, to maximize their willingness to participate and to minimise possible changes in their perception of relevance since they wrote the paper. Hence, the authors of accepted papers for ACL-2005 and HLT-EMNLP-2005 were asked, by email, for their research questions and relevance judgements for their references. Personalised materials for participation were sent, including a reproduction of their paper’s reference list in their response form. This meant that invitations could only be sent once the paper had been made available online.

This resulted in a test collection of 196 queries; however, we commented that the low number of judged relevant documents is potentially problematic (Ritchie et al., 2006). In line with Cranfield, Class Description and Example Typo Corrected spelling or typographical error in the research question, as returned by the author.

Handling biograpical questions with implicature in a question answering system. → Handling biographical questions with implicature in a question answering system.

Filler Removed part(s) of the research question that did not contribute to its meaning, e.g., contentless ‘filler’ phrases or repetitions of existing content.

We present a novel mechanism for improving reference resolution by using the output of a relation tagger to rescore coreference hypotheses. → improving reference resolution by using the output of a relation tagger to rescore coreference hypotheses.

Anaphor Resolved anaphoric references in the research question to ideas introduced in earlier questions from the same author.

How can the best alignment according to the model be found? → How can the best word-alignment according to the weighted linear model be found?

Context Added terms from earlier research questions to provide apparently missing context.

Identifying an appropriate domain → Identifying an appropriate domain - natural language generation

Table 1: Classes of Query Reformulation

therefore, we expanded our test collection to add judgements for non-cited papers. In §2, we present our methodology for this expansion, which we call Phase Two. We briefly survey the relevance data accumulated via our methods. In §3, we describe using our test collection with standard IR tools, comparing results before and after the judgement set is expanded. §4 concludes and outlines future work.

2 Expanding Our Test Collection Whereas the Cranfield expansion also involved adding more documents to the collection, the purpose of our Phase Two was solely to obtain more relevance judgements for the queries from Phase One. Our methodology was as follows.

First, we inspected the research questions returned in Phase One and noted that some were unsuitable as search queries. Mostly, these were artefacts of the method by which the queries were created: we did not explicitly ask the authors for independent search queries. Thus, where an author had returned multiple research questions, the later questions sometimes contained anaphoric references to earlier ones or did not include terms describing the background context of the research (that had been introduced in an earlier question). In addition, some questions contained spelling or typographical errors and some were formulated elaborately or verbosely, with many terms that did not contribute to the underlying meaning, e.g., contentless ‘filler’ phrases or repetitions of existing content. While a good retrieval system should be robust to query imperfections, this is outside the domain of our research. Therefore, we minimally reformulated 34 of the 196 research questions, to turn them into error-free, standalone queries, while keeping them as close to the author’s original research question as possible. Authors were asked to approve our reformulations (i.e., confirm that the reformulated query corresponded to their intentions) or to correct the query, for resubmission to the pooling process. Table 1 describes the four classes of query reformulation. We note that some number of the Cranfield queries were similarly reformulated (Cleverdon et al., 1966).

For each query, we next constructed a list of potentially relevant documents in the Anthology. We first ‘manually’ searched the entire Anthology using the Google Search facility on the Anthology website. We started with the author’s complete research question (or our reformulation) as the search query then used successive query refinements or alternatives. These query changes were made depending on the relevance of search results, i.e., relevance according to our intuitions about the query meaning and guided, where necessary, by the author’s Phase One judgements. Our manual searches were not strictly manual in the same sense of the Cranfield manual searches: we did use an automated search tool rather than search through papers by hand. We use the term ‘manual’ to indicate the significant human involvement in the searches.

We then ran the queries through three ‘standard’ IR models, implemented in Lemur 3, with standard parameters:

1. Okapi BM25 with relevance feedback

2. KL-divergence LM with relevance feedback and document model smoothing

3. Cosine similarity We pooled the manual and automatic search results, including all manual search results and adding one from each of the automatic retrieved lists (removing duplicates) to make a list of fifteen documents. If there were fifteen or more manual search results, only manual results (and all of these) were included, as these were felt to be more ‘trustworthy’, having already been judged as likely to be relevant. Some lists were, thus, longer than fifteen documents.

The list of potentially relevant documents was then included in personalised materials and sent to the query author for judgement. The materials included instructions and a response form in both plaintext and PDF, including the URL for a webpage with identifying details about the papers for relevance judgement (i.e., title and authors) and links to the papers in PDF, to aid the relevance decision.

We decided to ask for binary relevance judgements for this second round. Firstly, the relevance scale used in Phase One was designed for the specific task of grading the relevance of referenced papers in relation to the research question underlying the source paper; the grades were described in terms of how important the information in that reference would be to someone reading the paper. Judging the relevance of papers from outside the reference list is a slightly different task, therefore, and would have required a translation of the relevance scale. It was not clear that an exactly equivalent set of grades could have been formulated, such that a Phase One grade 4 was equivalent to a Phase Two grade 4 etc. Furthermore, it was already unclear whether we would be able to make use of the graded relevance judgements from Phase One, since most of the standard evaluation measures use binary relevance, without the added complication of having a new set of graded judgements that weren’t straightforwardly interchangable.

–  –  –

been collapsed in previous studies and shown to give stable evaluation results (Voorhees, 1998).

Additionally, in our case, the binary and graded judgements are made by the same person so we might conjecture that their judgement thresholds are more consistent. Therefore, we changed to binary judgements, in the hope that this would also make the task easier for the authors and encourage a higher response rate.

2.1 Returns and Analysis

Around 500 invitations were sent in Phase One. 85 completed response forms were returned, giving 235 queries with relevance judgements. We discarded queries from co-authors whose first author had also responded and queries with no relevant Anthology-internal references, leaving 196 queries, henceforth the All Phase One set.

74 invitations were sent in Phase Two and 44 forms were returned; 82 queries 4. 22 of these had been reformulated and all were approved by the author except two. In both cases, the author submitted an alternative reformulation for pooling and a new list (including the previous manual search results) was sent back for judgement. Both authors judged the (non-duplicate) documents in the new list.

Table 2 compares our test collection, before and after Phase Two, to some other test collections.



Pages:   || 2 | 3 |


Similar works:

«HUMAN RIGHTS WATCH 350 Fifth Avenue, 34th Floor New York, NY 10118-3299 Tel: 212-290-4700 September 5, 2012 Fax: 212-736-1300 Fax: 917-591-3452 Mr. Makhtar Diop Kenneth Roth, Executive Director Michele Alexander, Deputy Executive Director, Development and Vice President for Africa Global initiatives Carroll Bogert, Deputy Executive Director, External Relations World Bank Jan Egeland, Europe Director and Deputy Executive Director Iain Levine, Deputy Executive Director, Program 1818 H Street, NW...»

«Continental European Long-Term Gas Contracts: is a transition away from oil product-linked pricing inevitable and imminent? JONATHAN STERN September 2009 NG 34 i The contents of this paper are the author’s sole responsibility. They do not necessarily represent the views of the Oxford Institute for Energy Studies or any of its members. Copyright © 2009 Oxford Institute for Energy Studies (Registered Charity, No. 286084) This publication may be reproduced in part for educational or non-profit...»

«This electronic thesis or dissertation has been downloaded from the King’s Research Portal at https://kclpure.kcl.ac.uk/portal/ Aspects of tonal coherence in the motets of Josquin. Judd, Dana Cristle Collins The copyright of this thesis rests with the author and no quotation from it or information derived from it may be published without proper acknowledgement. END USER LICENCE AGREEMENT This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International...»

«A review of Imperial College’s institutional culture and its impact on gender equality 1. Background In 2015, the College commissioned Dr Alison Phipps from the Centre for Gender Studies at the University of Sussex to undertake a research project following a series of events involving the men’s student rugby team, which culminated in an investigation of the 2015 Varsity tournament on the grounds of sexism and unacceptable behaviour. In her statement on the Varsity incident, the then...»

«© Greenpeace/Davison Trading Away Our Oceans Why trade liberalization of fisheries must be abandoned © Greenpeace/Newman Trading Away Our Oceans Why trade liberalization of fisheries must be abandoned © Greenpeace/Gleizes © Greenpeace/Behring © Greenpeace/Åslund Trading Away Our Oceans report Why trade liberalization of fisheries must be abandoned “The Earth and the fullness of it belongs to every generation, and the preceding one can have no right to bind it up from posterity.” Adam...»

«Identifying Online Sexual Predators by SVM Classification with Lexical and Behavioral Features 1 Master of Science paper, Department of Computer Science, University of Toronto Colin Morris January 30, 2013 1 This paper incorporates portions of [Morris and Hirst, 2012], a paper I wrote with Graeme Hirst and presented at the PAN 2012 lab in Rome. Abstract We present a method for picking out sexual predators from a collection of online chats, and for identifying messages which are especially...»

«DECLARATION I N F A V OU R OF UNITED SECULAR EDUCATION IN IRELAND MEMBERS OF THE UNITED CHURCH OF ENGLAND A ND IRELAND : S è lt ti j % l a s t o f fb e S i g n a t u r e.D U B L IN : HODGES, SM ITH, AND CO., 104, G RA FTO N -STREET, PUBLISHERS TO THE UNIVERSITY. 1866. D U B LIN : ÿ víntctf a t tl)c © nífccrsítn ^ p rtB s,. n.J. by m il l PKEFACE. T he following Resolutions and Declaration sufficiently explain themselves. B ut in publishing the names of those who have signed the...»

«Wheelersburg Baptist Church 9/15/91 Ephesians 4:17-21 Living for God in an Ungodly World Intro (of service): ThemePurity Intro (of message): Review: Eph 1-3 Our Riches in ChristWhat do we have in Christ? See 1:3 Eph 4-6 Our Responsibilities in ChristWhat are we to do in Christ? 4:1 Walk worthy... How do we do that?1. Walk in Unity (4:1-16) 2. Walk in Purity (4:17-5:17) 3. Walk under the control of the Holy Spirit (5:18-6:9) 4. Walk on Guard against the Evil One (6:10-20) Illust: Key: As...»

«The Asian EFL Journal. Teaching Articles. November 2006 Vol 16 Schema–theory Based Considerations on Pre-reading Activities in ESP Textbooks Parviz Ajideh Tabriz University, Iran Bio data: Dr. Parviz Ajideh is an Assistant Professor in the English Department at Tabriz University in the Islamic Republic of Iran. His research interests include reading, testing, and translation. Abstract In most cases a common problem students experience in reading classes is the feeling that they know...»

«Issue 10(2) EJTIR June 2010 pp. 158-180 ISSN: 1567-7141 www.ejtir.tbm.tudelft.nl Lowest Cost Intermodal Rail Freight Transport Bundling Networks: Conceptual Structuring and Identification Ekki Kreutzberger 1 OTB Research Institute, Delft University of Technology Bundling, the process of transporting goods belonging to different flows in a common vehicle (like train, barge or truck) or other unit during part of their journey, is a core business of the transport sector. Operators periodically...»

«Prayer of Release For Freemasons And Their Descendants Prayer of Release for Freemasons & Their Descendants -2Prayer of Release For Freemasons And Their Descendants Introduction If you were once a member of a Masonic organization or are a descendant of someone who was, we recommend that you pray through this prayer from your heart. Please don't be like the Masons who are given their obligations and oaths one line at a time and without prior knowledge of the requirements. Please read it through...»

«661 TRAINEES IN GENERAL PRACTICE further study of trainee general practitioners A I. M. Richardson, m.d., Ph.D., F.R.c.p.Ed., f.r.c.g.p., d.p.h. J. G. R. Howie, m.d., m.r.c.g.p. J. S. Berkeley, m.r.c.g.p., D.Obst.R.c.o.G., Dip.soc. Med. Department of General Practice, University of Aberdeen previous paper (Richardson and Howie, 1972) we argued that more precise information In a required on the work done by trainee general practitioners and we gave some results from was simple study of a sample...»





 
<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.