Actor Database Systems

Data-intensive applications often utilize scalable data tier solutions to handle their load. Sometimes, these data tier solutions are simple key-value store services that partition and distribute the data across several nodes; although this approach offers scalable reads and writes, it still leaves the data processing complexity, e.g., joins, transactions, and constraint checking, to the middle layer of the application. Other data tier solutions are full-featured database management systems with built-in support for declarative querying, transactions, stored procedures, integrity constraints, and so forth; they usually offer data sharding and replication features for scale-out, but their monolithic architecture makes it hard to implement and maintain such features.

The Actor Model is a design pattern for distributed applications, i.e., applications that are supposed to run distributed on several nodes in a computer network. It has gained a lot of popularity in the past years, because workload parallelization and distribution became the driving forces to make modern applications faster. Code written in the actor model can, however, not only be scaled-out to different machines more easily, it is also easier to maintain and enforces a more natural system architecture. For this reason, recent research projects revisited the architecture of modern database management systems and proposed to re-implement them in the Actor Model.

In this seminar, we build our own database on top of the Actor Model. The challenge for this project is that all data – whether on disk or in main memory – must be stored as private state of some actor and each actor can only access its own state. Assume, for instance, that there is one actor for each table in the database. All requests to that table are handled by this actor (or its child actors). We then encounter the following (and further) challenges:

Multi-table operations: Queries that use operations, such as joins and intersections, require actors to combine their data views. The actors basically need to find the solution for such query operations via messaging and without exchanging their entire data.
Partitioning: The actor model demands that the data is split into partitions that are each owned by a different actor. Once partitioned, the data can easily be distributed with their respective actors, but the partitioning strategy can be arbitrarily complex, e.g., from “each table is owned by one actor” to “each block of 64 MB is owned by one actor”.
Replication: For fault tolerance and further performance improvements, one could consider replication of partitions. In this case, the actor database system requires additional replication strategies (may they be leader-less or leader-based) as well as consensus mechanisms.
Constraints: Integrity constraints, such as foreign-keys, span across different tables, which are – in our case – hold by different actors. Ensuring that these constraints are met with every write operation to each of the tables is a challenge and might require additional actor communication.
Transactions: A transaction is a sequence of data query and/or manipulation operations that should be consistently answered and/or applied. Since this usually involves different, independent actors, actor databases need their own strategies to successfully process transactions.
Asynchronous actor behavior: Actors communicate asynchronously but the outside world might work synchronously. Hence, at some stage, the actor database system needs to adapt to this outside behavior.

The goal of this seminar is to develop an actor database system prototype that offers a simple SQL-like API for basic CRUD functions on relational tables and their tuples. We will explore the different challenges for this new architecture and propose prototypical solutions. In the end, we evaluate our actor database system against some monolithic database management system.

For this seminar, participants require the following prerequisites (or must catch up on these at the beginning of the seminar):

Database knowledge (ideally Database System I and Database Systems II)
Akka knowledge (ideally Distributed Data Analytics)
Java or Scala knowledge

Organisation

Extent: 4 SWS
Location: G-3.E.11, Campus III
Dates: Monday, 11:00 - 12:30
Maximal 6 participants
The first date serves as an introduction to the topic and the seminar. Subsequently, you can register for the course through an informal email by April 13 to Thorsten Papenbrock. In case of more than six registrations, we have to pick slots randomly.
Your tasks
- Active participation at all meetings
- Implementation of the actor database system in teams of two students
- A midterm and a final presentation
- A written summary/documentation of your solution: 2-6 pages per person accoding to ACM template