Yahoo! Open Source Student Projects


Pig

Apache Pig Open Source Student Projects

We invite class projects that extend the Apache Pig data processing platform (http://incubator.apache.org/pig). The following is a list of project ideas, but feel free to use your imagination. Successful projects will be considered for incorporation into the Apache Pig code-base.

You may submit your project for consideration to the Apache community, via email at pig-dev@incubator.apache.org.

Project Suggestions

  • Embed Pig Latin into a Turing-complete scripting language such as Python.
  • Design and implement a compiler from (a subset of) SQL into Pig Latin. The compiler should perform some basic optimizations such as pushing projections ahead of joins.
  • Build an optimizing compiler for Pig Latin, perhaps incorporating some database query optimization techniques.
  • Implement fragment-and-replicate join in Pig (i.e., do the join in hadoop's "Map" phase, by making a full copy of the smaller of two inputs on every map node, building a lookup structure (e.g. hash table) over it, streaming the larger input through memory and probing for matches in the hash table). Also, have Pig automatically choose between this join technique and the current "symmetric hashing"/"reduce-side join" technique at runtime, based on the sizes of the input files.
  • Implement adaptive query processing techniques in Pig (for an overview of adaptive query processing, see Babu & Bizarro, Adaptive Query Processing in the Looking Glass, Proceedings of CIDR 2005).
  • Add a new backend system to Pig (e.g., a Dryad-like system?) with a corresponding compiler to run Pig Latin on the new backend.
  • Exploit indexes in Pig: recognize and use indexes while executing FILTER and JOIN commands, in cases where the index is expected to speed up execution. (Perhaps HBase or HyperTable can be used as the indexing technology.)
  • Use data layout information for query optimization in Pig--- if data is already pre-partitioned or pre-sorted on some attributes, then perform subsequent grouping / joining of the data on those attributes without reorganization (e.g., a merge-join instead of the default symmetric hash-join in Pig).

Pig Contact: Chris Olston - olston[at]yahoo-inc[dot com]

Apache Hadoop Open Source Student Projects

Apache Hadoop is an open source software project to enable data-intensive computing on large clusters. It includes a distributed file system (HDFS), programming support for MapReduce, and infrastructure software for grid computing. We are inviting class projects that extend this platform, and a list of project ideas is included below.

You may submit your project for consideration to the Apache community, via email at dev@hadoop.apache.org.

Project Suggestions

  • Modeling of block placement, replication and load balancing policies in HDFS
  • Modeling of the expected time to data loss for a given HDFS cluster, given Hadoop’s replication policy and protocols.
  • Prototyping approaches to scaling the HDFS namespace. Goals - Keep it simple; Preserve or increase meta-data operations / second; support very large numbers of files (billions to trillions) & blocks
  • Mounted archives as namespace extensions to allow transparent access to contents on archives
  • Implementation of a storage-backed namenode where only part of the namespace is cached in memory
  • Integration of Virtualization (such as Xen) with Hadoop tools
  • How does one integrate sandboxing of arbitrary user code in C++ and other languages in a VM such as Xen with the Hadoop framework? How does this interact with scheduling systems such as SGE, Torque, Condor?
  • Tools for discovering network topology and identifying and diagnosing hardware that is not functioning correctly
  • An improved framework for debugging and performance optimization
  • A distributed profiler for measuring distributed map-reduce applications. It should be able to provide standard profiler features , e.g. number of times a method is executed, time of execution, number of times a method caused some kind of failures, etc; maybe accumulated over all instances of tasks that comprised that application
  • Map reduce performance enhancements to improve the performance of the standard Hadoop performance sort benchmarks.
  • Sort and shuffle optimization in MR framework. Some example directions:
    • Memory-based shuffling in MR framework
    • Combining the results of several maps on rack or node before the shuffle. This can reduce seek work and intermediate storage.
  • A framework for capturing workload statistics and replaying workload simulations to allow the assessment of framework improvements
  • Benchmark suite for Data Intensive Supercomputing: A suite for data-intensive supercomputing application benchmarks that would present a target that Hadoop (and other map-reduce implementations) should be optimized for
  • Performance evaluation of existing Locality Sensitive Hashing schemes. Research on new hashing schemes for filesystem namespace partitioning http://en.wikipedia.org/wiki/Locality_sensitive_hashing
  • Propose an API and sample use cases for a file as a repository of blocks where a user can add and delete blocks to arbitrary parts of a file. This would allow holes in files and moving blocks from one file to another. How does this reconcile with the sequence-of-bytes view of file? Such an approach may encourage new styles of applications.To push a bit more in a research direction: UNIX file systems are managed as a sequence-of-bytes but usually (and in Hadoop's case exclusively) used as a sequence of records. If the file system participates in the record management (like mainframes do for example) you can get same nice semantic and performance improvements.