About the Talk
Summary written by Nuri Mazouz & Nico Matzke
Recent research introduces more efficient storage formats and improves compression rates. However these improvements rarely or never catch on, because they are not widespread enough to justify the implementation effort. Meanwhile, they are not implemented because they are not widespread. This talk combined two recent research works in data management: AnyBlox: Portable and Extensible Data Access for Data Processing Systems and Towards Designing Future-Proof Data Processing Systems. Together they address the central question of how data-processing systems can remain both up-to-date and efficient in the face of changing requirements, different data formats, and an evolving hardware landscape. We will discuss two main aspects: (1) The challenge of extensible and secure data access that is still performant and portable as well as (2) The broader vision of a unified, adaptable architecture for future data-processing systems.
Designing Future-Proof Data Processing Systems
Background
Modern data-processing systems face a range of fundamental challenges. One key problem lies in continuously changing user requirements. New application domains demand support for additional data types and processing capabilities, such as string functions like jaccard-similarity, graph data, or complex user-defined operations. At the same time, new input formats appear frequently, often specific to individual domains, such as genomics or sensor data.
Another source of change is the computing environment itself. Hardware developments such as GPUs, DPUs, high-bandwidth memory, and non-volatile RAM, as well as the increased use of cloud-based storage alter the assumptions under which traditional systems were designed. As a result, data processing systems today muste be expandable and flexible in addition to being high-performing and maintainable if they are to remain usable in the long term. These goals often conflict with each other: for example, extensions muste also be kept up-to-date.
Another dimension of the problem concerns data storage. While classical database systems operated on centralized and normalized data, modern data architectures increasingly rely on data lakes and object stores. These environments contain data in a variety of formats such as JSON, CSV, Parquet, or domain-specific representations. Systems must therefore be capable of decoding and interpreting all of these formats efficiently.
This situation leads to what the talk described as the N x M problem (likened to the similar issue in compiler research): if there are N data-processing systems and M data formats, then in principle N x M different readers or decoders must be implemented. Such an approach is not scalable and hinders extensibility and maintainability (as seen historically in the field of compiler-development). Consequently, a suitable abstraction between data-processing systems and storage formats becomes a central requirement for future-proof architectures.
Aspects
Data Acess and the AnyBlox Approach
A central question in designing future-proof data-processing systems is what kind of abstraction can mediate between data-processing engines and the storage formats they access. This abstraction should ideally be portable, secure, performant, and extensible. Existing mechanisms such as native extensions, external processes, or eBPF-based approaches offer partial solutions but fail to meet all these requirements simultaneously.
The AnyBlox framework introduces a new design principle to address this limitation: query engines should not read data formats, but instead formats should "read themselves". In practice, this means that each format is distributed together with a corresponding decoder implemented in WebAssembly (WASM).
The overall architecture consists of three components: the host system (e.g., a query engine), the AnyBlox layer, and the decoder. When a query accesses data, the host system interacts with the AnyBlox runtime, which in turn executes the WASM decoder responsible for interpreting the data. The data itself is memory-mapped in read-only mode, and the decoder operates on this memory region within a sandboxed runtime.
Decoding results are then returned in the standardized Apache Arrow representation, which can be directly used in existing execution environments without reshaping data. The approach reduces the complexity of integration: instead of N x M implementations, only N + M components are needed, since data-processing systems and decoders can be developed independently.
Experimental results show that this method can archieve performance comparable to native format support in established systems such as DuckDB, DataFusion, and Vortex. Moreover, the design ensures a high degree of portability and security, since WebAssembly provides process isolation and controlled memory access.
Beyond Data Access: Towards Unified Architectures
While AnyBlox primarily focuses on the interface between data and system, the second research work discussed in the lecture proposes a broader conceptual framework for designing future-proof data-processing systems. The central idea is to unify the diverse modes of interaction with data, ranging from imperative code in languages like Python to declarative SQL queries, through a common intermediate layer.
This layer, referred to as a hinge layer, exposes a set of basic operators that serve as a common intermediate representation (IR). Different front-ends, such as SQL, user-defined operators, or low-level imperative programs, then can be compiled into this IR. Based on this representation, a runtime system constructs a dependency graph that captures the relationships between logical states (such as relations or hash maps) and computational tasks.
Once this dependency graph is established, the runtime can then make informed scheduling decisions, optimize data movement, and adapt execution strategies to the available hardware. This architecture allows the system to express not only which computations need to be performed, but also the structural properties and dependencies of the data being processed.
Summary
The long-term goal of this project is to build a unified runtime environment capable of dynamically adapting to new hardware configurations, data representations, and usage patterns. The vision presented by Prof. Dr. Jana Giceva seems quite ambitious: it combines concepts from compiler design with database systems, and runtime scheduling into an integrated framework. Work on this approach is still ongoing, particularly regarding the extraction of the intermediate representation from different programming interfaces and the integration of AnyBlox as a data-access layer. According to Prof. Dr. Jana Giceva, realizing this vision will require at least several years of research of a somewhat large team as well as close collaboration across the involved specialized domains.
References