Scale-In, then Scale-Out - New Database Scaling Options with FPGAs and Hardware Acceleration

Thomas Richter and Yana Krasteva, Swarm64, Berlin

Short Overview

Swarm64 is a developer of hardware accelerator solutions and software for PostgreSQL, one of the most widely used databases in the world. The company provides an easy way for businesses to scale PostgreSQL performance for analytics systems.

Swarm64s vision is to allow people to keep it simple and to keep complexity low. Therefore the company follows the motto: "First scale-in before you scale-out" and provides high-performance PostgreSQL extensions for faster analytics and easier scaling. The idea is to first scale in with FPGA-based parallel processing before adding more machines. So when Swarm64 is installed, it programs the FPGA with hundreds of processes that work in parallel to write, read, filter, compress, and decompress data within the database tables for higher analytic database performance.

The company was founded in 2013 in Norway and has now operations in Boston, Palo Alto, and Berlin. Until now, 35 people work at Swarm64. Swarm64 investors include intelCapital, Investinor, Alliance Venture, and Xilinx.

Thomas Richter is one of the three founders of the company and today CEO. Dr. Yana Krasteva did her Ph.D. at the Technical University of Madrid and dealt there with the topic of runtime reconfigurable and adaptive systems. She gained extensive experience in software engineering and industrial application R&D for FPGAs in domains like supercomputing, aerospace, video processing, and construction. Since 2014 she works at Swarm64, currently holding the position of VP of Product.

A recording of the presentation is available on Tele-Task.

Summary

written by Fabian Heseding, Jannis Rosenbaum, Patricia Sowa, and Wanda Baltzer

Introduction to Problems, Goals, Solution

Data is more valuable today than ever; its variety and volume are continually increasing, and so is the speed at which it needs to be processed. Moreover, the reliability of these data processing systems is a problem that demands a solution. Until now, mainly individual modular solutions have been used to meet these challenges. Swarm64's vision is to abstract the complexity and multilayeredness of these systems. They, therefore, shape the motto: "First scale-in before you scale-out." Swarm64 is convinced that existing hardware should be upgraded or equipped with further components, before purchasing new servers and computing capacities.

However, there are many existing solutions for hardware acceleration. Options are the upgrade of the CPU, the GPU, the installation of ASICs, or the use of FPGAs. These options are illustrated in Fig. 1 in terms of efficiency and flexibility. Every one of these solutions has multiple advantages and disadvantages.

The CPU is very flexible for programmers and optimal for “Serial Programs.” It can be equipped with terabytes of memory and offers simple debugging options. It is not without reason that DBM systems have relied on the CPU since their early days to perform their tasks. New CPU generations usually offer more performance to keep up with increased data volumes - but in recent years, CPUs have become more and more exhausted. The high flexibility comes at a high price when it comes to raw computing capacity.

This is a reason why GPUs have increasingly been used. They can not only process “Serial Programs” but also "Data Parallel Programming" using single instruction, multiple threads ("SIMT") with outstanding performance. Unfortunately, interfaces to other hardware components quickly become bottlenecks. Thus only comparatively limited amounts of RAM can be efficiently connected. The developer’s experience is downgraded because programming and debugging are more complex. So it is not surprising that there are only a few DBMS that truly optimal utilize the GPU even though column-based data would theoretically be suitable for SIMT.

Application-specific integrated circuits are not used despite their outstanding performance as their multi-year and costly development cycles cannot keep up with the fast-moving data processing world. So Swarm64 opted for the remaining alternative: the FPGA.

An FPGA can be programmed to execute any logic but does so "closer to the metal" and, therefore, much faster than CPUs. Although the development is not very comfortable, it is more flexible than ASICs, and updates can be done in a few seconds.

With this approach and a PostgreSQL module called Swarm64 Database Accelerator, which is designed for several FPGA systems, it was possible to create a solution that meets the initial requirements for modern data warehouses. It is significantly cheaper and more straightforward than "scale-out" solutions. Nevertheless, this module addresses some other problems too.

Many so-called data warehouses rely on old and no longer supported DBMS and analysis systems. Other problems are high licensing costs in the millions of dollars per year and having difficulty scaling and integrating new data if outdated data warehouse platforms like Oracle, Netezza, and DB2 are run.

Swarm64 intends to modernize these warehouses through their technology. Furthermore, since they rely on the free, open-source product PostgreSQL as a DBMS and not on proprietary software like many legacy solutions, customers can build on a large community. Many developers and admins are experienced with PostgreSQL as well, which reduces the cost of PostgreSQL further when using the Swarm64 Data Accelerators.

A special case of this deprecated software is IBM's Netezza. In 2003, Netezza revolutionized analytics with its SQL data warehouse appliance. It offered FPGA boards working in parallel to ingest and query large volumes of data warehouse data on PostgreSQL.

Swarm64 offers more than just a simple replacement product. It is faster than Netezza and much more flexible because only single servers in a cluster have to be equipped with FPGAs, while Netezza scales only per quarter. Furthermore, Swarm64 can easily be hosted at large cloud hosters with FPGA support; Netezza only works with a classic appliance model that has to be hosted on own servers.

Another concept Swarm64 is challenging is Time-Series Databases. With the technology and speed that the Swarm64 Database Accelerator can achieve, millions of data changes can be written to a standard PostgreSQL database. So even IoT sensor data can be captured and evaluated in real-time - without the need for specialized Time-Series Databases.

Background: DBMS

Let us take a look at the background of these problems. To refresh: A "database" in the common language consists of two components. On the one hand, the data itself also called a database, and on the other hand, the database management system (DBMS), which manages the data and takes care of querying existing data and inserting new data, as well as corresponding optimizations for both processes. When we talk about a database, we often mean the DBMS. But for what - at a high level of abstraction - do today's software landscape use databases?

They are mainly used for three different purposes. First as a system of record, a system in which data is stored. Since we are talking mostly about important business data, the data thus must be processed quickly, saved, and protected from data loss. On the other hand, we usually want to read individual data records, so we have to search through a lot of irrelevant data to find the right index. In both cases, low latency will count.

Moreover, Databases are essential for systems of analysis, i.e. systems in which a lot of data is read and analyzed. Here we are talking about terabytes of entries that have to be scanned and aggregated. In this case, the bandwidth encounters its limits.

Last but not least, there are the systems of engagement; in other words, those systems where users are encouraged to collaborate, where the data of this system is usually numerous and has a real-time component. Individually the queries are latency dependent, but in aggregate, this is a question of bandwidth.

So if we want to cover all these overlapping use cases, we have to optimize latency and bandwidth. A goal Swarm64 is committed to - how exactly they achieve this goal will be explained later on. However, Swarm64 is also based on a very well-known database: PostgreSQL, an open-source object-relational SQL database, which has been in constant development since the end of the 80s. It is widely used in the developer community and is the fourth most used database worldwide, according to db-engines.com.

Swarm64 decided on PostgreSQL because this free, open-source database offers feature robustness, reliability, and high performance after thirty years of development. Besides this popularity, it was very interesting for Swarm64 because of its easy extensibility. Due to this feature, they could develop their Database Accelerator as an add-on to PostgreSQL without having to modify the code of PostgreSQL itself. So every user of the Swarm64 Data Accelerators can have a "normal" PostgreSQL instance running, which he can configure by further modules or open source modifications as he sees fit. Swarm64 "just" added FPGA acceleration and thereby extended PostgreSQL functionality of analytics.

Background: FPGAs

We have already discussed a vast amount about what can be achieved with FPGAs and why they are better than other hardware accelerators. Nevertheless, what precisely is an FPGA? An FPGA consists of many individual cells arranged in an array. These cells consist of a lookup table (LUT for short - a table that assigns an input to each combination of inputs) and a flip-flop.

Of course, a lookup table - also known as a truth table - can be used to hardcode any logical function. The possible complexity of this function depends on the number of inputs; nowadays, there are mostly 6 of them. These cells are then linked together in a configurable way. That way, very complex functions like the database optimizations of Swarm64 can be programmed into the SRAM of the lookup tables. Hence the name: Field Programmable Gate Array, because they can be reprogrammed in the field and consist of an array of connected logic gates.

The FPGAs for data centers are usually addressed via PCIe, a standard interface with a standard form factor, so that these cards can be installed in almost any server. Of course, this was not always the case, but a result of one of the many trends in the field of FPGAs.

Like many other technologies, FPGAs were initially very expensive, but this changed as manufacturing processes became cheaper, and more manufacturers joined along. The first to use their expensive FPGAs to process large amounts of data was Netezza, which was later acquired by IBM. However, IBM announced in 2018 that it would discontinue this activity. Since then, powerful FPGA cards for enterprise use can be purchased for a few thousand Euros and fit into any server rack. So even smaller companies and startups can optimize their data processing.

Besides, there are significantly improved developer tools that abstract programming and make it easier for classic developers to write software for FPGAs. There are successful efforts by using High-Level Synthesis to automatically convert algorithms written in C into a register transfer language. Therefore, the developer does not have to be experienced in the latter. A rethinking from the "data movement" concept of classical programming to the "data flow" concept of FPGAs is of course still necessary, but also easier to learn. The developer does not have to worry about many pre- and post-processing steps anymore, but can be relieved of the work by libraries.

When a final program is ready to be used on an FPGA, it is easier than ever to roll it out and actually execute it, because many conventional and new cloud hosters provide servers with FPGAs that can be rented per minute if required. However, with FPGAs-as-a-Service the end of development is not yet reached, because FPGAs will also be integrated into other products in the future, such as the Samsung SmartSSD, which will also be supported by Swarm64 and is a combination of SSD and the generally available FPGA. But also, smart network interfaces are now equipped with FPGAs, and this trend will undoubtedly continue.

Approach

Swarm64 uses different approaches to transform their vision into reality. For one, they concentrated on accelerated tables, which allow handling very large tables and are efficient when using range queries and when data is moving at a high velocity. However, an alternative should be used if the interactions with the table are based on many small transactions, or the goal is to get only single values from an index column.

Another approach is the use of optimized columns to improve data layout. They are best for handling dates, prices, quantities, sensor data, and spatial data and are commonly used for columns that are queried by range. The more optimized columns are used, the faster the query becomes, while the order stays irrelevant. However, the usage is limited to one to three columns per table.

To improve the insertion of records, Swarm64 queues the data in the CPU after insertion and compresses it using the hardware accelerators. Thereby, instead of updating the table and indices in a row store or preparing the data and updating the columns in a column store, Swarm64 is able to achieve 20 million records per second instead of only one million records per second. This approach is illustrated by Fig. 2.

Fig. 2: Fast, compressed and hardware-accelerated solution of storing new records

Swarm64's solution for a query is to extend PostgreSQL software by allowing it to stay more parallel, as shown in Fig. 3. To achieve this, blocks are stored compressed in a hybrid row/column store, which is Swarm64's structure of the proprietary optimized columns. As a result, less I/O-transfer and less storage space for indices are needed. These blocks are sent to the FPGA, where query processing is initiated. First, the blocks are decompressed as part of the FROM section of the query. Afterward, as part of SELECT, columns are picked, and rows are picked as part of the WHERE section. Finally, the result is assembled and sent back to the CPU to be finalized.

The FPGA is able to process hundreds of these jobs parallel to each other. Moreover, simple filtering conditions and wildcard filtering get pushed down to the FPGA and achieve the advantage of the acceleration. For the best overall performance, the FPGA supports the CPU in different tasks. It loads the data into the database keeping the data at rest and compressed in the cache. Additionally, it supports the CPU with selecting and filtering and includes complex processing such as time-series queries or full-text search with wildcards.

Main results

With the FPGA parallel processing strategy, Swarm64 was able to achieve up to 50x faster queries, 35x faster loading, and 5x less storage space. Moreover, the total cost of ownership for a completely maintained PostgreSQL Server is 4x lower with Swarm64. Their PostgreSQL extension, in combination with the hardware acceleration, has proven to be a significant improvement to enable faster analytics and easier scaling.

Swarm64 achieved astonishing results in data warehouse modernization, Netezza replacement, and time-series data. In particular, replacing costly, outdated databases with PostgreSQL and Swarm64 hardware acceleration, supporting discontinued Netezza workloads, high-velocity data insertion, and fast, concurrent queries are Swarm64's sweet spot.

These three key ideas have been consolidated into one system by Swarm64, demonstrating a highly sought after use case for trading companies. Speaking numbers, Swarm64 had to support a minimum of 100.000 transactions per second with 5.000 financial symbols to analyze on a ten-second analysis level.

They did not use a traditional architecture, scaling out and adding a network between components, which would have resulted in latency and bandwidth concerns. Instead, Swarm64, with the help of Samsung's SmartSSD, Intel Optane DC Persistent RAM, and FPGAs, scaled in, resulting in the ability to increase transactions from 100.000 per second to 12.000.000 per minute, and analysis from a ten-second level to well below a second while keeping CPU rather low. Thus, a trading company could significantly increase market coverage and critically decrease decision time, which was demonstrated by instantly publishing simple trading decisions to a live feed.

Hardware accelerators, as used by Swarm64, have been trending lately, but they are definitely here to stay. In particular, Samsung's SmartSSD containing a hardware accelerator will be only slightly above the price of a comparable SSD while offering hardware acceleration.

Hardware accelerators can be quite powerful, and Swarm64 developed methods to use them for their strengths, demonstrating mechanisms to scale inside before scaling out or up. This is possible with two key components: the Swarm64 hardware acceleration, and the Swarm64 PostgreSQL support. Not only does the hardware acceleration decrease complexity by scaling in with lower costs than other scaling strategies, but the Swarm64 PostgreSQL solution also keeps the cost of ownership steady or lower than other systems. Swarm64 also contributes back to the PostgreSQL community.

Scale-In, then Scale-Out - New Database Scaling Options with FPGAs and Hardware Acceleration

Thomas Richter and Yana Krasteva, Swarm64, Berlin

Short Overview

Summary

written by Fabian Heseding, Jannis Rosenbaum, Patricia Sowa, and Wanda Baltzer

Introduction to Problems, Goals, Solution

Background: DBMS

Background: FPGAs

Approach

Main results

References:

Chair

News

20.11.2024 | Paper on Ecological Efficiency of Database Servers accepted at CIDR 2025

09.08.2024 | Paper on Query Compilation for GPUs accepted at LWDA '24

18.07.2024 | Stork paper accepted at DATAI '24

08.03.2024 | CXL Buffer Management Paper Accepted at HardBD & Active '24

01.02.2024 | InferDB paper accepted at VLDB '24

Events

24.03.2022 | FG DB Symposium

Directions