Publications
Search

Publications :: Search

Generating Example Data for Dataflow Programs

Show publication

On this page you see the details of the selected publication.

    Publication properties
    Title: Generating Example Data for Dataflow Programs
    Rating: (1)
    Discussion: 0 comments
    Date: 2009
    Publication type: Conference paper
    Authors:
    No. First name Last name Show
    1. Christopher Olston
    2. Shubham Chopra
    3. Utkarsh Srivastava
    Download (by DOI): 10.1145/1559845.1559873
    BibTeX: conf/sigmod/OlstonCS09
    DBLP: db/conf/sigmod/sigmod2009.html#OlstonCS09
    Bookmark:

    The following keywords have been assigned to this publication so far. If you have logged in, you can tag this publication with additional keywords.

    Keywords
    No keywords have been assigned to this publication yet.

    If you log in you can tag this publication with additional keywords

    A publication can refer to another publication (outgoing references) or it can be referred to by other publications (incoming references).

    Incoming References
    No incoming references have been assigned to this publication yet.
    Outgoing References
    No outgoing references have been assigned to this publication yet.

    If you log in you can add references to other publications

    A publication can be assigned to a conference, a journal or a school.

    Conference Track
    Conference Name: ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009 2009
    Track Name: Research
    URL: http://www.sigmod09.org/

    Abstract

    While developing data-centric programs, users often run (portions of) their programs over real data, to see how they behave and what the output looks like. Doing so makes it easier to formulate, understand and compose programs correctly, compared with examination of program logic alone. For large input data sets, these experimental runs can be time-consuming and inefficient. Unfortunately, sampling the input data does not always work well, because selective operations such as filter and join can lead to empty results over sampled inputs, and unless certain indexes are present there is no way to generate biased samples efficiently. Consequently new methods are needed for generating example input data for data-centric programs.

    We focus on an important category of data-centric programs, dataflow programs, which are best illustrated by displaying the series of intermediate data tables that occur between each pair of operations. We introduce and study the problem of generating example intermediate data for dataflow programs, in a manner that illustrates the semantics of the operators while keeping the example data small. We identify two major obstacles that impede naive approaches, namely (1) highly selective operators and (2) noninvertible operators, and offer techniques for dealing with these obstacles. Our techniques perform well on real dataflow programs in use at a major search engine company for web analytics.