Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI
  • de

Jens Krüger

Enterprise-specific In-Memory Data Management: HYRISEc - An In-Memory Column Store Engine for OLXP

Enterprise applications are presently built on a 20-year-old data management infrastructure that was designed to meet a specific set of requirements for transactional systems. In the meantime, enterprise applications have become more sophisticated, data set sizes have increased, requirements on the freshness of data have been strengthened, and the time allotted for completing business processes has been reduced. To meet these challenges, enterprise applications have become increasingly complicated to make up for shortcomings in the data management infrastructure. These complications increase the total cost of ownership of the applications and make them harder to use. This thesis pursues the idea of designing an enterprise application-specific database engine, which is better optimized for the observed workload and data characteristics, while leveraging latest hardware trends and advances in data processing algorithms. As a result, the actual requirements, data characteristics, and as workloads from today's enterprise applications are extracted, a novel workload category called Online Mixed Workload Processing (OLXP) is defined and the enterprise application-specific database engine HYRISEc is presented. HYRISEc facilitates read-optimized in-memory data structures since today's database systems are designed for a more update intensive workload than they are actually facing. Traditional read-optimized databases use a dictionary encoded compressed column-oriented approach, especially in combination with an in-memory architecture. Inserting a tuple in such a compressed store is as complex as inserting a value in a sorted column, because the entire compression has to be rebuilt. Furthermore, traditional index structures cannot be applied efficiently as these databases are not based on a page structure.

To handle updates in a compressed storage efficiently, HYRISEc implements a technique called differential store, maintaining a small write-optimized delta partition that accumulates all updates. Periodically, this delta partition is combined with the read-optimized main partition. This merge process involves decompressing the compressed main partition, merging the delta and main partitions, and re-compressing the resulting main partition. For transactional enterprise applications it is crucial that their data is always available in a 24/7 environment and system downtimes are not allowed. As such, the merge process must be performed online and fast enough, so as not to degrade the update throughput. The update performance of such a system is limited by two factors: a) the insert rate for the write-optimized structure, and b) the speed with which the system can merge the accumulated updates back into the read-optimized partition, while keeping the system online without any downtime. The merge process becomes the main bottleneck for the system, and needs to be optimized by orders of magnitude to support fast updates. Consequently, a fast attribute merging algorithm is introduced that performs a linear-time update of the compressed main partition, and performs multi- core aware optimizations to exploit the underlying high compute and bandwidth resources of modern multi-core CPUs.

With regards to fast lookups, memory bandwidth and latency are limiting the execution speed of queries and, therefore, this scarce resource has to be used economically to maximize performance. Scanning a complete column results in the transfer of the entire column from memory to the processor and the costs depend linearly on the column length. A common approach to speed up the access to highly selective subsets of the data is to use indices which enable searches in logarithmical time. Thus, an index is introduced that leverages the proposed architecture. This includes the compressed column- oriented storage for the actual index data structures and the attribute merge algorithm for index maintenance.

To summarize, this thesis presents research results illustrating how an in-memory database engine can be implemented for OLXP by introducing data structures and algorithms to enable fast updates and lookups by leveraging the potential of a read-optimized store at the same time.