Jorge Quiané-Ruiz presents "Let's Stop Hurting the Poor Yellow Elephant!"

Jorge Quiané-Ruiz
Title: "Let's Stop Hurting the Poor Yellow Elephant!"   Jorge Quiané-Ruiz ABSTRACT            Nowadays, there exist a large variety of frameworks for data storage and computation, such as Hadoop, Pig, Hive, Scope, Dryad, Shark, and Spark. These frameworks typically consist of a distributed file system (such as HDFS) coupled with a some sort of computation framework (such as Hadoop MapReduce). As these frameworks become popular, the interest to improve their performance for specific applications also increases. For example, Hadoop has been the focus of a large number of optimizations at the HDFS and MapReduce framework levels. Unfortunately, all these research works come at a cost: they are usually done via deep changes to the framework itself, which causes several problems. First, this leads to code bifurcation and hence it is hard (or even impossible) for Hadoop users to try out new features outside the codebase. Second, each research work focuses on a single aspect of the distributed file system or computation framework. Again, it is hard for Hadoop users to benefit of all these optimizations in a single system. Third, supporting new features inside such frameworks is a tedious and expensive task, which makes it hard to try out new ideas. In this talk, I will present Oryx, a powerful and general data processing system where User Defined Functions (UDFs) are the first-class citizens. Oryx is designed to run on top of any existing distributed file system and computation framework. As a result, Oryx allows users to implement ad-hoc solutions to efficiently support specific applications, without requiring users to change the underlying frameworks. In particular, I will present Cartilage, an instance of Oryx for providing flexible big data storage on top of HDFS. Besides discussing how to implement existing data storage techniques (such as indexing and data layouts) in Cartilage, I will present completely new data storage schemes that are enabled by Cartilage and were not possible before. Additionally, I will present Violet, another instance of Oryx for supporting scalable violation detection (a crucial task in data cleaning) on top of Spark. I will show the huge benefits, in terms of scalability and performance, of abstracting the violation detection as an Oryx instance.   BIOGRAPHICAL NOTE Jorge is a Scientist at the Qatar Computing Research Institute (QCRI) since October 2012. His current research interests include databases, distributed data management, and big data analytics. Before joining QCRI, Jorge was research associate at Saarland University for three years. At Saarland, Jorge mainly worked on data management in MapReduce, where he developed Hadoop++ and HAIL among others. He did his Ph.D. in Computer Science at INRIA in Nantes, France under the supervision of Patrick Valduriez.