FREE ELECTRONIC LIBRARY - Online materials, documents

Pages:   || 2 | 3 | 4 | 5 |   ...   | 12 |

«Foundations and Trends R in Databases Vol. 1, No. 4 (2007) 379–474 c 2009 J. Cheney, L. Chiticariu and W.-C. Tan DOI: 10.1561/1900000006 Provenance ...»

-- [ Page 1 ] --

Foundations and Trends R in


Vol. 1, No. 4 (2007) 379–474

c 2009 J. Cheney, L. Chiticariu and W.-C. Tan

DOI: 10.1561/1900000006

Provenance in Databases: Why, How, and Where

By James Cheney, Laura Chiticariu

and Wang-Chiew Tan


1 Introduction 380

1.1 Why, How and Where: An Overview 382

1.2 Approaches in Computing Provenance: Eager vs Lazy 392

1.3 Notational Preliminaries 394 2 Why-Provenance 397

2.1 Lineage 397

2.2 Why-Provenance 407 3 How-Provenance 416

3.1 Provenance Semirings 417

3.2 Trio Lineage 421

3.3 Provenance Semirings and Recursion 427

3.4 How-Provenance for Schema Mappings 430 4 Where-Provenance 441

4.1 Where-Provenance 441

4.2 Applications 450 5 Comparing Models of Provenance 453

5.1 Relating Semiring-Based Techniques 454

5.2 Relating Lineage, Why-Provenance and Minimal Why-Provenance 458

5.3 Where-Provenance is “Contained” in Why-Provenance 459

5.4 Is Where-Provenance an Instance of How-Provenance? 460 6 Conclusions 463 Acknowledgments 469 References 470 Foundations and Trends R in Databases Vol. 1, No. 4 (2007) 379–474 c 2009 J. Cheney, L. Chiticariu and W.-C. Tan DOI: 10.1561/1900000006

Provenance in Databases:

Why, How, and Where James Cheney1, Laura Chiticariu2 and Wang-Chiew Tan3 University of Edinburgh, UK, jcheney@inf.ed.ac.uk IBM Almaden Research Center, San Jose, CA, USA, chiti@almaden.ibm.com University of California, Santa Cruz, CA, USA, wctan@cs.ucsc.edu Abstract Different notions of provenance for database queries have been proposed and studied in the past few years. In this article, we detail three main notions of database provenance, some of their applications, and compare and contrast amongst them. Specifically, we review why, how, and where provenance, describe the relationships among these notions of provenance, and describe some of their applications in confidence computation, view maintenance and update, debugging, and annotation propagation.

Introduction Provenance information describes the origins and the history of data in its life cycle. Such information (also called lineage) is important to many data management tasks. Historically, databases and other electronic information sources were trusted because they were under centralized control: it was assumed that trustworthy and knowledgeable people were responsible for the integrity of data in databases or repositories. As argued by Lynch [49], this assumption is no longer valid for online data. Today, data is often made available on the Internet with no centralized control over its integrity: data is constantly being created, copied, moved around, and combined indiscriminately. Because information sources (or different parts of a single large source) may vary widely in terms of quality, it is essential to provide provenance and other context information which can help end users judge whether query results are trustworthy.

Data warehouses [17] and curated databases [10] are typical examples where provenance information is essential. In both data warehouses and curated databases, tremendous (and often manual) effort is usually expended in the construction of the resulting database — in the former, in specifying the extract-transform-load (ETL) process and in the latter, in incrementally adding and updating the database. In a sense, provenance adds value to the data by explaining how it was obtained.

Hence, it is of utmost importance to understand the provenance of data in the resulting database, in order to check the correctness of an ETL specification or assess the quality and trustworthiness of curated data.

Provenance has been studied in several different areas of data management, such as scientific data processing [8, 29, 53] and database management systems [15, 57]. We focus on provenance for data residing in a database management system. A number of notions of provenance in databases have been proposed in the literature. The most common forms of database provenance describe relationships between data in the source and in the output, for example, by explaining where output data came from in the input [58, 13], showing inputs that explain why an output record was produced [27, 13] or describing in detail how an output record was produced [43]. Besides being interesting in their own right for understanding the behavior of queries, these forms of provenance have been used in the study of classical database problems, such as view update [14] and the expressiveness of update languages [11].

More recently, they have also been used in the study of annotation propagation [7, 11, 58] and updates across peer-to-peer systems [42].

In this article, we focus on these three existing notions of why-, howand where-provenance in databases. We shall describe them, discuss their applications, and compare and contrast these different notions in the subsequent sections. In the rest of this introductory section, we provide a high-level overview of these different notions of provenance, and introduce notation that will be used throughout the rest of the article.

Sections 2, 3 and 4 focus on why-, how- and where-provenance, respectively, including formal details and applications. Section 5 discusses the relationships among the approaches, including proofs or disproofs of some “folklore” properties which have been stated in the literature but not (to our knowledge) carefully formalized and proved. Finally, Section 6 concludes with a brief discussion of additional related work and research challenges.

We emphasize that there are numerous other notions of provenance that are not described in this article. For example, provenance is also an active topic of research in scientific workflow management system 382 Introduction community and in the file and storage systems community. This article focuses on provenance within databases, and we refer the interested reader to the surveys [8, 53], and a recent tutorial [29] for a discussion on provenance research in general, as well as in the workflow community.

Recent workshops [33, 35] also provide insight into the different views of provenance by diverse research communities.

Even in database settings, there is work that does not fit neatly into the why-, where- and how-provenance framework we focus on here, including early work such as Wang and Madnick’s Polygen model [58] and Woodruff and Stonebraker’s work on lineage [61], as well as Cui et al.’s lineage model [27] and more recent work on the Trio system [6].

We have chosen to focus on the why-, where-, and how-provenance framework because there is now enough related research on these models area to justify a critical review and comparison. We have recast lineage and a simplification of the Trio model as instances of our framework, but the Polygen, Woodruff–Stonebraker, and full Trio models seem to resist this categorization. Our classification should therefore be viewed as a preliminary attempt towards a full understanding of provenance in databases. We return to this issue in Section 6.

1.1 Why, How and Where: An Overview 1.1.1 Why-Provenance Cui et al. [27] were among the first to formalize a notion of provenance, of data in the context of relational databases, called lineage. They associated each tuple t present in the output of a query with a set of tuples present in the input, called the lineage of t. Intuitively, the lineage of t is meant to collect all of the input data that “contributed to” t or helped to “produce” t. To illustrate, we use a simple example database of an online travel portal shown in Figure 1.1, where the labels t1,..., t8 are used to identify the tuples. Consider the query Q1 1 shown below, which asks for all travel agencies that offer external boat tours and their corresponding phone numbers by joining Agencies with ExternalTours

–  –  –

The result of Q1 executed on our example database in Figure 1.1 is shown above on the right. According to Cui et al., the lineage of the output tuple (HarborCruz, 831-3000) is {Agencies(t2 ), ExternalTours(t7 )}, where Agencies(t2 ) and ExternalTours(t7 ) denote the subinstances of Agencies and ExternalTours consisting of tuples t2 and t7, respectively.

Intuitively, the two source tuples witness the existence of the tuple of interest, (HarborCruz, 831-3000), according to Q1. Furthermore, each of the two source tuples justify the existence of the HarborCruz tuple.

In other words, the source tuples t2 and t7 form a “proof” or “witness” for the HarborCruz output tuple according to Q1, and no other source tuples are part of the witness since they do not contribute to the HarborCruz output tuple. Technically speaking, by “witness” we mean a subset of the input database records that is sufficient to ensure that a given output tuple appears in the result of a query.

As another example, the lineage of the output tuple (BayTours, 415-1200) is the union of the lineage of the intermediate 384 Introduction tuples — (BayTours, San Francisco, 415-2000, Santa Cruz, boat, $100) and (BayTours, San Francisco, 415-2000, Monterey, boat, $250) — before the projection operator is applied on name and phone. The union of the lineage of these two intermediate tuples gives {Agencies(t1 ), ExternalTours(t5,t6 )}. Observe that this lineage representation is not as precise as one may like as it does not specify that t5 and t6 do not need to coexist together in order to witness the BayTours output tuple.

Indeed, {t1, t5 } and respectively, {t1, t6 } are two different witnesses for the BayTours tuple. This illustrates that not every tuple in the lineage is “necessary” for the output (BayTours, 415-1200) to be produced.

This intuition was formalized by Buneman et al. [13] who introduced the notion of why-provenance that captures the different witnesses. Their work is in the context of a semi-structured data model with a query language that is appropriate for that data model, but we shall restrict our discussion to the relational model with select–project– join queries here.

Like lineage, why-provenance is based on the idea of providing information about the witnesses to a query. Recall that a witness is a subset of the database records that is sufficient to ensure that a given record is in the output. There may be a large number of such witnesses because many records are “irrelevant” to the presence of an output record of interest. In fact, the number of witnesses can easily be exponential in the size of the input database. To avoid this problem, the definition of why-provenance restricts attention to a smaller number of witnesses.

According to [13], the why-provenance of an output tuple t in the result of a query Q applied to a database D is defined as the witness basis of t according to Q. The witness basis is a particular set of witnesses which can be calculated efficiently from Q and D. This is generally much smaller than the full witness set. However, every witness contains an element of the witness basis, so the witness basis can be viewed as a compact representation of the set of all witnesses.

Going back to our example, the why-provenance of (BayTours, 415in the result of Q1 is the set {{t1, t5 }, {t1, t6 }}. There are two witnesses, corresponding to {t1, t5 } and {t1, t6 }, respectively. Intuitively, this tells us that the output tuple is witnessed by source tuples in two different ways according to Q1 : the first uses the tuples t1 and

1.1 Why, How and Where: An Overview

–  –  –

Fig. 1.3 Example showing that why-provenance is sensitive to query rewriting.

t5, while the second uses the tuples t1 and t6. Observe that {t1, t5, t6 } is not a minimal witness, since the query Q1 requires witnesses to consist of exactly one tuple from Agencies, and one tuple from ExternalTours according to the FROM clause of Q1.

The preceding discussion suggests that the witness basis may be tied to the structure of the query and it is therefore sensitive to how a query is formulated. To illustrate, consider the instance I and two equivalent queries Q and Q shown in Figure 1.2. For conciseness, we use the Datalog conjunctive query notation to express Q and Q here and throughout the paper as convenient. Consider the output tuple (1, 2) in the result of Q (and Q ) applied to I shown in Figure 1.3.

The witness basis of this output tuple is {{t}}, according to Q and I.

However, even though Q is equivalent to Q, the witness basis of the output tuple (1, 2) according to Q and I is {{t}, {t, t }}.

Although equivalent queries may have different witness bases, Buneman et al. [13] showed that a subset of the witness basis, called the minimal witness basis, is invariant under equivalent queries. The minimal witness basis consists of all the minimal witnesses in the witness basis, where a witness is minimal if none of its proper subinstances is also a witness in the witness basis. For example, {t} is a minimal witness for the output tuple (1, 2) in Figure 1.2. However, {t, t } is not a 386 Introduction minimal witness since {t} is a subinstance of it and it is a witness to (1,2). Hence, the minimal witness basis is {{t}} for this example. In a subsequent work by [14], minimal witnesses were used in the study of variants of the view deletion problem, which is that of finding source tuples to remove in order to delete a tuple from the view for selectproject–join–union queries.

1.1.2 How-Provenance Why-provenance describes the source tuples that witness the existence of an output tuple in the result of the query. However, it leaves out some information about how an output tuple is derived according to the query. To illustrate, consider the query Q2 of Figure 1.4 which asks for all cities where tours are offered (assuming all agencies offer tours in the city they are headquartered). The result of Q2 on the example database in Figure 1.1 is shown in the right of Figure 1.4. (Ignore the additional tags on the output tuples for now.) For the output tuple (San Francisco, 415-1200) in the result of Q2, its why-provenance is {{t1 }, {t1,t3 }}. This description tells us that t1 alone, and t1 with t3 are each sufficient to witness the existence of the output tuple according to Q2. However, it does not tell us about the structure of the proof that t1 (as well as t1 and t3 ) help witness the output tuple according to Q2.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 12 |

Similar works:

«UNITED STATES DISTRICT COURT FOR THE DISTRICT OF COLUMBIA ) SECURITIES AND EXCHANGE COMMISSION, ) 100 F Stre~t, N.E. ) Washington, D.C. 20549 ) Plaintiff, ) ) v. Civil Action No. ) ) ALLEN BARNETT ) THOMAS STINER, Case: 1:09-cv-00457 Assigned To : Sullivan, Emmet G. Defendants. Assign. Date: 3/9/2009 Description: General Civil COMPLAINT Plaintiff Securities and Exchange Commission (SEC or Commission) alleges as follows: SUMMARY 1. This case involves a fmancial and accounting fraud perpetuated...»

«Article The Relation Between Translation and Ideology as an Instrument for the Establishment of a National Literature Nüzhet Berrin Aksoy Meta : journal des traducteurs / Meta: Translators' Journal, vol. 55, n° 3, 2010, p. 438-455.Pour citer cet article, utiliser l'information suivante : URI: http://id.erudit.org/iderudit/045064ar DOI: 10.7202/045064ar Note : les règles d'écriture des références bibliographiques peuvent varier selon les différents domaines du savoir. Ce document est...»

«Puritan Bennett™ 980 Ventilator System User’s Pocket Guide Breathe More NATURALLY ©2012 Covidien. COVIDIEN, COVIDIEN with logo, Covidien logo and positive results for life are U.S. and internationally registered trademarks of Covidien AG. *Proportional Assist and PAV are registered trademarks of The University of Manitoba, Canada. Used under license. Other brands are trademarks of a Covidien company. The information contained in this pocket guide is the sole property of Covidien and may...»

«Clin. exp. Immunol. (1972) 10, 223-234. UNRELEASED INTRACELLULAR MONOCLONAL MACROGLOBULIN IN CHRONIC LYMPHOCYTIC LEUKAEMIA D. HUREZ,* G. FLANDRIN, J. L. PREUD'HOMME AND M. SELIGMANN Research Institute on Blood Diseases, HMpital Saint-Louis, Paris (Received 26 July 1971) SUMMARY In two patients presenting as chronic lymphocytic leukaemia, immunofluorescence studies have demonstrated the presence of an accumulated intracytoplasmic material reacting with antisera to,u chains and only to one type...»


«P  Noninterest Income and Financial Performance at U.S. Commercial Banks Robert DeYoung Tara Rice Emerging Issues Series Supervision and Regulation Department Federal Reserve Bank of Chicago August 2003 (S&R-2003-2) Noninterest Income and Financial Performance at U.S. Commercial Banks Robert DeYoung Federal Reserve Bank of Chicago Tara Rice Federal Reserve Bank of Chicago Forthcoming in The Financial Review. Abstract: Noninterest income now accounts for over...»

«The A merican in Ch arity: Benito C ereno an d Gothic A nti-Sentimentality Peter Coviello Studies in American Fiction Autumn 2002 [... ]In the essay that follows, I want to contend that the vehicle Melville fashions to carry this anger, this furious disma y before a direly inept and seeming ly ineducable Am erican readership, is the great novella of slave revolt Benito Cereno ; that its method is, in a complicated way, gothic; and that what not indirectly inspires and crystallizes this...»

«Volume 6(3): 559–571 Copyright © 1999 SAGE (London, Thousand Oaks, CA and New Delhi) Review Article: The Organization of Thought reviews section Richard Weiskopf and Hugh Willmott University of Innsbruck and Manchester School of Management Organizational Analysis as Deconstructive Practice, Robert Chia. Berlin/ New York: De Gruyter, 1996. Organization theories are academic products produced within the context of socially legitimized public institutions which are themselves effects of primary...»

«A Report to The Congress of the United States on The State of the Humanities and The Reauthorization of The National Endowment for the Humanities by The American Council of Learned Societies Copies may be obtained for $7.00 from THE AMERICAN COUNCIL OF LEARNED SOCIETIES 228 East 45th Street, New York, N.Y. 10017 Library of Congress Catalog Card Number 85-71938 Contents not copyright CONTENTS Page Introduction v THE REPORTS OF THE LEARNED SOCIETIES American Academy of Religion American...»

«AwoX SmartLIGHT™/ AwoX SmartLIGHT™Color Οδηγός χρήστη Руководство User Guide Mode d’emploi Käyttöopas пользователя Benutzerhandbuch Felhasználóiv útmutató Användarhandbo Guía del Usuario Brukerveiledning Navodila za uporabo Manuale d’istruzioni Podręcznik użytkownika Kullanıcı Kılavuzu Návod k použití Guia do utilizador Brugervejledning EN © 2014 AwoX. All rights reserved. AwoX SmartLIGHT, AwoX SmartLIGHT Color, AwoX, the AwoX logo and...»

«Pastor J. Harper, President of State Convention Apostle Walter Camp, 1st Vice President Minister Danny Current, Dean of Christian Education Pastor Michael Hansberry, Southern District Moderator Rice Memorial Missionary Baptist Church Apostle Walter F. Camp, Presiding Senior Pastor Pastor David K. Baker III, 2nd Presiding 802 W. 15th, Little Rock, AR 72202 Rice Memorial MBC www.RiceMemorialBaptistChurch.org Showing Humility Definitions in bold letters are from Strong’s Exhaustive Concordance...»

«REVISED INTERIM WRITTEN DESCRIPTION GUIDELINES TRAINING MATERIALS Contents Synopsis.. 4 Decision Trees Written Description Amended or New Claims or Claims Asserting the Benefit of an Earlier Filing Date. 6 Original Claims.. 7 Example 1: Amended claims. 10 Example 2: 35 USC 120 Priority. 13 Example 2A: Essential element missing from original claim. 15 Example 2B: A preferred element missing from original claim. 17 Example 3: New claims.. 19 Example 4: Original claim. 22 Example 5: Flow...»

<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.