Towards Cloud-based Enterprise Applications on Large Shared Datasets
Enterprises are moving their applications to public cloud environments to benefit from the resource elasticity and cost efficiency that their infrastructures provide. The resulting collocation of applications brings an opportunity to share application data within and across organizations for integrated analytics. Current database systems, however, do not exhibit either the in-memory performance for real-time analytics on data inside an organization, or the elasticity for efficient ad-hoc combination with large external data. In this work, we discuss a database architecture that exploits modern cloud infrastructure to combine both. Specifically, we make three contributions:
First, we present a main memory storage engine that is centered around a columnar, serialization-free, and interoperable data format. This storage engine enables efficient data exchange between application stacks by eliminating costly data transformations. Data can be accessed in-place over RDMA or as database checkpoints on remote shared cloud storage. The performance for both transactional and analytical data processing is kept up through auxiliary data structures, such as MVCC version chains, indexes, and filters.
Second, we propose a massively-parallel query engine that runs the relational operators as short-lived and stateless cloud functions against shared storage. The cloud functions are specified in C++ to utilize scarce per-function resources efficiently, e.g., via SIMD capabilities and tailored memory management. The query engine thereby embraces the fine-grained compute resource consumption model of current cloud platforms.
Third, we design a query plan optimizer that targets our execution engine. It takes into account the particularities of the cloud platform resources, i.e., strict time and space limits on compute units and multi-tiered storage with various price-performance points. It further respects the cloud provider's resource pricing to produce fast yet budget-conscious query plans.