Our paper "BabelMR: A Polyglot Framework for Serverless MapReduce" by Fabian Mahling, Paul Rößler, Thomas Bodner, and Tilmann Rabl was accepted at SDA '23 (co-located with VLDB).
Abstract:
The MapReduce programming model and its open-source implementation Hadoop have democratized large-scale data processing by providing ease-of-use and scalability. Subsequently, systems such as Spark have dramatically improved efficiency. However, for a large number of users and applications, using these frameworks remains challenging, because they typically restrict them to specific programming languages or require cluster management expertise.
In this paper, we present BabelMR, a data processing framework that provides the MapReduce programming model to arbitrary containerized applications to be executed on serverless cloud infrastructure. Users provide application logic in Map and Reduce functions that read and write their inputs and outputs to the ephemeral filesystem of a serverless function container. BabelMR orchestrates the data-parallel programs across stages of concurrent cloud function executions and efficiently integrates with serverless storage systems and columnar storage formats. Our evaluation shows that BabelMR reduces the entry hurdle to analyzing data in a distributed serverless environment in terms of development effort. BabelMR's I/O and data shuffle building blocks outperform handwritten Python and C\# code, and BabelMR is competitive with state-of-the-art serverless MapReduce systems.