Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

SAP HANA Software Development Process

Alexander Böhm, SAP SE

Abstract

In his talk, Dr. Alexander Böhm gives an overview of processes that the development of large, highly complex database management systems requires. After a brief summary of the SAP HANA characteristics as an in-memory column-store database, a multifaceted view on the development process is presented. This includes challenges posed by diverging requirements on the software, tooling and development environments for large C++ projects, utilized principles of software development such as test-driven development and continuous integration and delivery, as well as a deep dive into advanced testing concepts and how the team manages exploding growth of testing efforts.

Biography

Dr. Alexander Böhm is responsible for the DBMS architecture management. With the company since 2010, his focus lies on performance aspects and driving strategic, technical, and operational optimization projects. His role also includes steering HANA development with respect to novel hardware and technology. Before joining SAP, he received his PhD from the University of Mannheim working on distributed message processing systems.

A recording of the presentation is available on Tele-Task.

Summary

SAP HANA: A column-store database for hybrid workloads

SAP HANA is one of the first commercially adopted in-memory column-store transactional databases. Around 2010, development was prompted by promising results of research into DBMS suited for hybrid transactional and analytical processing (HTAP systems). This research was motivated by observations of current hardware trends, especially increased availability and sinking cost of main memory and a shift to parallel processing on multi-core systems. It became apparent that many data workloads could soon be handled entirely in-memory.

The diversion from traditional disk-based storage engines invited reconsiderations of the usual row-wise layout, in which individual records are stored contiguously. This layout facilitates online transactional processing (OLTP), such as INSERT/UPDATE operations and materialization of rows, for which the number of reads/writes can be kept low. It is, however, not ideal for analytical queries, e.g. dynamic aggregation or full table scans on columns without an index. For these operations that logically operate on columns, unnecessary data accesses occur when data is loaded to the CPU in blocks.

Such online analytical processing (OLAP) workloads greatly benefit from a column-wise layout. It maximizes throughput to the CPU for sequential column element accesses and enables efficient use of hardware caches, compressing techniques and vectorized processing with SIMD instructions. Transactional processing, in turn, is complicated with this data layout, as records have to be inserted at multiple locations. More information about this approach can be found in [0].

Traditionally, multiple database systems running concurrently would, therefore, be dedicated to either OLTP or OLAP, with periodic ETL processes duplicating data into the analytical systems so that transactional systems could run unhindered by analytical queries. HTAP database management systems aim to overcome this redundancy by providing sufficiently fast transactional processing and high-performance analytical processing on the same column-store data structures that are kept in main memory.

Team Setup

The HANA development team is located in six international sites, centering in three main hubs in Canada, Germany, and Korea where in addition to feature development also research is conducted. Roughly 1.100 employees are part of the team.

The key skills of the developers include performance-aware knowledge in C, C++, rarely ASM, and knowledge on modern software development best-practices. Expertise in operating systems, hardware, and advanced topics in database technologies are essential as well.

SAP HANA Development Process

Customers heavily rely on a database as a source of ground truth at the core of every enterprise application. To meet this trust, it must be highly stable and reliable in order to avoid downtime or loss of data. Also, enterprise use mandates other capabilities such as encryption, auditing or backup and recovery. In the case of SAP HANA, performance of the database is also a crucial feature that is used to distinguish it from competing products. Despite these strict requirements, the development team must also be able to react to feature requests to increase usefulness for customers and therefore adoption. To ship updates quickly, new on-premise versions are released yearly and cloud versions in a monthly cadence.

This forces the team to find a careful balance between a high development pace to meet feature pressure and nevertheless delivering a highly stable and regression-free complex DBMS. In the following, we detail the techniques and technologies used to reconcile these counteracting goals:

Tooling and Development Environment

The codebase of the HANA DBMS now encompasses more than 11 million lines of C++ code. Builds are released exclusively for linux-based systems, so that no deviation from commonly used build tools such as CMake and gcc is necessary. For local development, the team members are free to choose a development environment, i.e. they are not dictated to use a certain IDE, git client, etc. and can choose what they are most comfortable with. With git, another industry standard is used for version control. The project management with all tasks is done via JIRA, the bug tracking via Bugzilla and for documentation, a correspondingly large Wiki is used. The team also uses more specific tools especially useful in the context of database development: For testing purposes, Google Test is used, a framework commonly used for C++ projects. Performance benchmarking and regression analysis are conducted with Intel VTune, which can be used to find bottlenecks. Most interestingly, the team utilizes a reverse debugger from Undo. Open-source alternatives include rr by Mozilla. This type of debugging runs an 'observed' execution of the program that records the machine state for every instruction. From a trace file, program execution can be reversed after the fact, which allows the examination of hard-to-reproduce bugs. This is especially useful for complex systems such as databases, where for example concurrency and race conditions can lead to problems.

Processes of Software Development

As a strawman for a traditional mode of development, Dr. Böhm examines the waterfall model. Usually, this model leads to severe problems when requirements for the software change. Its sequential process has different phases that are each accomplished one after the other. The phases are Requirement Analysis, System Design, Implementation, System Testing, System Deployment, and System Maintenance. Problems occur simply because one is unable to change product features during this process. Since the requirements have been designed at the beginning, they can not be changed in another phase which is usually given when feedback about the software is incorporated. Therefore, the waterfall model is very inflexible and unsuitable for SAP's HANA development.

Instead, the concept of continuous integration and delivery (CI/CD) is used. Each developer can integrate code at any time and contribute to the mainline of the code, with automated building and testing ensuring a lower limit of confidence in the software's correctness. Frequent integration of small changes allows the customer to see if the features are what they want in their product. They can test it and provide feedback even before changes are final. An important backbone of the strategy is test-driven development (TDD).

Test-driven development (TDD)

TDD is one of the main ideas to ensure stable and reliable products. Implementation of a feature starts by writing one or more Unit Tests that check the newly required functionality. The developer can then concentrate on writing code until the test passes. This supports quick development as manual checking of functionality is reduced and encourages only adding necessary code. As a major side effect, each feature is at least minimally tested before integrating it into the software. After the test passes, efforts can then shift to refactoring code and improving the implementation.

Tests exist on different levels of granularity, serving different purposes:

Unit tests

The majority of available tests; they are fine-granular checks of low-level functionality which should be quick to write and execute. They serve as basis of TDD.

Component tests

Ensure the correctness of software subsystems, e.g. the transaction manager of the database. They can be added once a feature is thoroughly reviewed.

End-to-end tests

Test behaviour of software in its entirety, i.e. the complete DMBS. These tests check functional as well as non-functional requirements and prevent major failures.

While developing a feature, unit tests for the updated parts are executed locally to detect most errors at an early stage. The larger the scope of a test, the less frequently it is then executed in the automated pipelines. The existence of thorough testing is a precondition for successfully utilizing a CI/CD approach.

Continuous Integration Practice

After a developer finishes local development via TDD, changes enter the CI/CD pipeline. To manage the complexity of several hundreds of contributing developers, these changes by individuals are not added to a central development branch, but instead to one of over a hundred topic branches that bundles related development activity. There, collaboration with team members with related expertise can happen to finalize the changes. This includes code review from topic experts or senior members, as well as automated component testing. Once the team is confident in the update, it gets merged into a release branch. The teams are encouraged to integrate their topic branch roughly once a week into the release branch. These proposed merges also trigger the rest of the CI pipeline with more complex and holistic tests. The final merge happens if no failures are reported during the comprehensive tests.

To illustrate the sheer size of SAP HANA and the importance of this for the software team, Mr. Böhm presented some key figures regarding the DBMS: - Per day, there are roughly 30 changes to the mainline code, stemming from ca. 1500 individual commits. - Full testing of a build incorporates 1.2 million individual tests. - Sequential execution would have a duration of over three weeks. - Over 200 performance test suites with ca. 35,000 KPIs.

In contrast to other database developments, the performance tests are triggered by updates, and not only run via a fixed schedule. This is necessary because of the integral part that performance plays in the product, and with the high volume of changes, regressions might be identified too late with scheduled testing. However, the automated evaluation of the various KPIs is a major challenge that the team faces.

In addition to these performance safeguards, the CI process contains comprehensive checks of code quality. This includes strict code coverage thresholds and zero-tolerance of -Wall warnings or critical issues reported by tools for static security/quality scans. Sub-teams are also free to impose further quality measures such as additional levels of compiler warnings that must not occur.

Malfunction Testing

As a motivation for Malfunction Tests, Mr. Böhm mentions the case of SQLite, a thoroughly tested and highly engineered database project. The SQLite team reached 100% branch coverage - however, in a test scenario using a fuzzer that inputs random data into the system, multiple errors were easily discovered. This showcases the false sense of security that KPI-driven testing can create. Therefore, the HANA team incorporates Malfunction Tests into its automated testing, e.g. by forcing out-of-memory situations through the memory management subsystem of HANA or by hiding hardware components during run-time. There are also libraries to inject failures in POSIX system calls. In HANA's case, these tests uncover missing timeout functions and failure routines. Other types of hardware and software failures are much harder to test throughout the development process, such as RAID controller or memory chip failures, or dropped network packages. In the literature, this idea is known as chaos engineering - in a cloud scenario, this can extend to simulate outages of containers, servers or even entire data-centers.

Extensive Testing with Growing Teams

As a result of a growing team and software system, the amount of test executions increases in a super-linear fashion over time. This follows the simple reasoning that in total, more developers commit increasingly often, with each commit triggering test execution and adding additional tests.

Indeed, the number of test executions within the SAP HANA development process has been going up over time in a quadratic fashion. Resources spent on testing, however, are already at a limit with around thousand servers purely dedicated to running tests.This prompted the introduction of a risk-based testing approach, in which a 'test budget' is allocated to each software component. Tests are run prioritized according to their efficiency in detecting issues and the code coverage that they produce. Still, once a day a full test run is executed. The approach halved the test runtime and saved approximately 104 years of sequential test runtime over five months. Regarding additional failures in later stages of the pipeline, each failure saved roughly 2 years of testing time. With this factor in mind, tweaking the efficiency threshold for testing has been reduced to an economic decision between the positive effects of reduced test runs and the manual labor needed to asynchronously fix failures that were uncovered during the less frequent tests.

References

If not denoted otherwise, illustrations are taken from Dr. Böhm's slide deck.

[0] Plattner, Hasso. "A common database approach for OLTP and OLAP using an in-memory column database." Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.

[1] Bach, Thomas, Ralf Pannemans, and Sascha Schwedes. "Effects of an Economic Approach for Test Case Selection and Reduction for a Large Industrial Project." 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 2018.