We are creating infrastructure to support ad-hoc analysis of very large data sets. Parallel processing is the name of the game. Our system runs on a cluster computing architecture, on top of which sit several layers of abstraction that ultimately bring the power of parallel computing into the hands of ordinary users. The layers in between automatically translate user queries into efficient parallel evaluation plans, and orchestrate their execution on the raw cluster hardware.
The highest abstraction layer in Pig is a query language interface, whereby users express data analysis tasks as queries, in the style of SQL or Relational Algebra. Queries articulate data analysis tasks in terms of set-oriented transformations, e.g. apply a function to every record in a set, or group records according to some criterion and apply a function to each group. Set-oriented transformations are inherently amenable to parallel evaluation, because the processing logic for each record (or group of records) is self-contained, and the order in which outputs are produced is immaterial. The layers between the query interface and the raw cluster hardware are responsible for planning and executing efficient parallel evaluation strategies for queries. In designing these intermediate layers, we focus on re-use of derived data, joint evaluation of multiple (sub) queries, and intelligent data placement and replication strategies.
Pig is an Apache incubator open-source project: http://incubator.apache.org/pig/
Pig Latin: A Not-So-Foreign Language for Data Processing
C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins. ACM SIGMOD 2008 International Conference on Management of Data, Vancouver, Canada, June 2008. View Paper
Apache Pig Open Source Student Projects
We invite class projects that extend the Apache Pig data processing platform. For a full list of project ideas, click here. Successful projects will be considered for incorporation into the Apache Pig code-base.