Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Moving towards interactive Data Analysis

Carsten Binnig, Data Management Lab, TU Darmstadt

Abstract

Technology has been the key enabler of the current Big Data movement. Without open-source tools like R and Spark, as well as the advent of cheap, abundant computing and storage in the cloud, the trend toward datafication of almost every field in research and industry could never have happened. However, the current Big Data tool set is ill-suited for interactive data analytics to better involve the human-in-the-loop which makes the knowledge discovery a major bottleneck in our data-driven society. In this talk, I will present an overview of our current research efforts to revisit the current Big Data stack from the user interface to the underlying hardware to enable interactive data analytics and machine learning on large data sets.

Biography

Carsten Binnig is a Full Professor in the Computer Science department at TU Darmstadt and an Adjunct Associate Professor in the Computer Science department at Brown University. Carsten received his PhD at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of data management systems for modern hardware as well as modern workloads such as interactive data exploration and machine learning. His work has been awarded with a Google Faculty Award, as well as multiple best paper and best demo awards.

A recording of the presentation is available on Tele-Task.

Summary

written by Jonas Kordt, Jonas Buecker, Maximilian Schall, and Mino Boeckermann

Part 1: Interactive User Interfaces

Technology has been the key enabler of the current Big Data movement. Without open-source tools like R and Spark, as well as the advent of cheap, abundant computing and storage in the cloud, the trend toward datafication of almost every field in research and industry could never have happened. However, the current Big Data tool set is ill-suited for interactive data analytics to better involve the human-in-the-loop which makes the knowledge discovery a major bottleneck in our data-driven society. In his talk, Carsten Binnig presented an overview of our current research efforts to revisit the current Big Data stack from the user interface to the underlying scalable data management systems to enable interactive data analytics and machine learning on large data sets.

Fig. 1 Darmstadt Data Analysis Stack

We will start on the top with the User Interface Layer, Vizdom, while also explaining the Interactive Execution IDEA since it allows the UI to be interactive. Vizdom allows pen and touch interactions to provide the user with a simple and natural way to create visualizations. Since research [1] has shown that a response time of more than 500ms already limits the exploration space and productivity of the user, this was the essential goal in achieving an interactive UI. To reach this speed in computing, IDEA operates as a middleware between the data source and the Vizdom application using an approximate query engine to speed the whole process up. It splits the SQL-queries into smaller queries, approximates the result, and uses a result cache based on Bayes Theorem to cache intermediate results of all visualization as random variables.

The second part of the UI layer evolves around DBPal, a natural language (NL) interface of the DBMS, to enable a natural and concise way to query data. A user without a profound knowledge in database queries should be able to use natural language like “Show me the patients with fever over 65 years” to get the results they want. The big challenge here is how to translate natural language into SQL-queries. The solution Binnig and his team came up with was deep learning, but there the question of how to get the necessary training data for deep learning arose. Crowdsourcing, what is often used to gather this training data, is not an option since the translation of natural language into SQL is not really what a big crowd of untrained people can do. So they decided to generate their training data themselves and used templates to generate NL/SQL query pairs. Augmentation techniques, paraphrasing and noising, were used to multiply the amount of NL/SQL pairs. To test the power of DBPal Benchmark tests on a simple database schema for patient data and a complex schema for geographic data were conducted. In this benchmark, DBPal was compared to NaLIR, a traditional rule-based approach to Natural User Interface, and to NSP and NSP++, which are deep learning models with manually created training data. The benchmark compared the performance in a range of linguistic categories and DBPal performed very good overall (see figure 2). The basis for this is a highly functional, scalable Database Management System (DBMS). The development of this, the lower half of the Darmstadt DA Stack, will be the main topic of the following parts.

Fig. 2 Benchmark Results
Fig. 3 Network-Attacched Memory (NAM) Architecture

Part 2: Scalable Data Management

Using the interactive data analysis tools already transforms how you can analyze data. But what happens when your data grows, and you can’t keep it all in one place anymore? The database system you are building on needs to grow, and thus it must scale well. Typically when scaling up a database system to use many machines, the network used to become a bottleneck very quickly.

When Carsten Binnig started his research in this field, he and his team took a look at high-speed networks that used to be very expensive. But they realized that prices were dropping and performance was further increasing. They found out that with four network cards per server, they could match the speed of DDR3 main memory. Thus network suddenly was not the bottleneck anymore.

But testing their distributed database system with the high-speed network revealed that it was still not scaling. The problem was a huge overhead in handling TCP/IP messages. More machines meant more messages, which meant the CPUs were busy handling all those messages and couldn’t process database transactions. Another problem was central data structures like counters because those could only be modified by one machine at the same time, so other machines would have to wait, which created bottlenecks.

RDMA instead of TCP/IP

A great way to solve the TCP/IP bottleneck is Remote Direct Memory Access (RDMA). RDMA bypasses the operating system, which means that data from another machine can be read (or written) without going through the operating system stack, including the TCP/IP stack, of the other machine.

RDMA has two types of communication. One-sided communication includes reading from and writing to memory of another machine directly, without using the CPU of the other machine. Circumvention of the remote CPU improves scalability because this CPU won’t be a bottleneck. Two-sided communication includes sending to and receiving from another machine. Here both CPUs are involved, but the advantage of bypassing the operating system is still there.

Network-Attached Memory (NAM) architecture

With RDMA and the separation of state and computation, the database stack can be redesigned. The first step is to replace all communication that was previously done using TCP/IP with RDMA communication. The second step is to have dedicated compute servers that execute the queries and dedicated memory servers which store the state.

Fig. 4 Workload optimization

Now, any compute server can access any state no matter on which memory server it is stored using RDMA. This means that central bottlenecks can be avoided. If, for example one CPU is overloaded doing some computation and it is supposed to execute a query any other compute server can just take over the query. This works because it too can access all the necessary state for the query. The same goes for memory servers if one becomes a bottleneck some of its data can just be moved to a different memory server because it doesn’t matter on which memory server the data is stored.

This architecture brings another advantage. Compute power and memory bandwidth can be scaled up independently from each other by adding more servers with a particular purpose. This can even be used to optimize for different workloads. Suppose you have six servers available to you. If you have a compute-intensive workload, you can use four of them as compute servers and the other two as memory servers. But for a different workload, which is more memory intensive, you could divide the six servers up equally into three compute servers and three memory servers.

Fig. 5 Scalability of NAM

Optimization using locality

To optimize this new architecture even further, the different servers in figure 4 can be viewed as logical separations instead of physical separations. Then you can introduce locality and collocate a compute server and a memory server on the same machine, which gives the advantage that if the compute server needs to access the state on said memory server, it can skip RDMA entirely and use local memory access. This can heavily increase the performance of the system even more.

The scalability was measured by Binnig’s team. They measured the transactions per second for the different variations of the system with up to 56 machines. The results can be seen in figure 5. The blue line is the old database system using the newer high-speed networks. The red line displays the new NAM system using RDMA but not collocating memory and compute servers on the same physical machine. This already scales linearly but the performance can be improved using locality, as you can see with the yellow line.

Fig. 6 Design Matrix

Part 3: Indexing

One last question arises for remote memory servers. How to enable efficient remote access of remote tables on memory servers? If there is such an architecture as in figure 3. Without indexes, the compute servers would need to scan through the whole memory for a single data point or a range query. Traditional Relational Database Management Systems (RDBMS) are using indexes to achieve efficient memory access, by minimizing the number of disk accesses required when a query is processed.

The design space for such index structures has two dimensions. How the index itself is distributed and how the index structure is accessed. Two possible designs for index distribution are coarse-grained Distribution and fine-grained Distribution. To access the indexes, the two methods of RDMA presented earlier are used: One-Sided RDMA and Two-Sided RDMA. Figure 6 shows both aspects and which useful combinations do exist. These will be described in detail later, and additionally, a hybrid of both designs as well.

 

Fig. 7 Coarse-Grained/Two-Sided RDMA

Design 1: Coarse-Grained/Two-Sided RDMA

The numerical indexes are range partitioned between every memory server. The compute server requests data from the memory server, which stores the index next to the corresponding data. If a compute server wants to access an item or data item, the transaction is sent to the memory server via a two-sided request. The memory server traverses the index by using its CPU and returns the result.

Only one round trip request is necessary, but the results are sensitive to asymmetry in their distribution, and if more compute servers need data from the same memory server, it becomes a bottleneck.

Fig. 8 Fine-Grained Distribution/ One-Sided RDMA

Design 2: Fine-Grained Distribution / One-Sided RDMA

Every index node of a B-Tree, a self-balancing tree, is distributed in a round-robin manner across N memory servers. For a compute server to access data, it first needs to read the index page stored in the root node on one memory server and then find which pointer needs to be traversed for the next child. Fetch that memory and decide to go left or right. This repeats with the leaves until the data is found.

This algorithm results in multiple round trips but better load balancing than design 1, since multiple servers are touched and the memory is better utilized. For range queries, often, multiple pages are returned.

Fig. 9 Hybrid from Coarse-Grained Distribution and Fine-Grained Distribution

Design 3: Hybrid from Coarse-Grained Distribution and Fine-Grained Distribution

In this approach, the two other designs are combined. The upper layers of the index are coarse-grained distributed and the lower layers follow a fine-grained distribution. The compute server sends the two-sided request with the required key to the memory server. The memory server then traverses the tree and returns the pointer, which contains what leaf page is required. The compute server then fetches the memory of the leave(s) through one-sided access.

The advantage of this design is that in the last step, most of the data needs to be read anyway. By using a two-sided request for the index traversal, only one roundtrip is required to find the location of the necessary pages.

Fig. 10 Benchmarks

Part 4: Evaluation

To see the practical effect of the different designs they were tested in an environment simulating real world conditions. On the hardware side the test setup to evaluate the different designs consisted of four memory servers and six compute servers. Up to 240 clients were employed in order to evaluate the scalability of the designs. The requests consisted of two different types of queries. Point queries request a single row of data, whereas range queries request multiple rows using a condition such as, for example, age greater than 30. The data provided by the servers consisted of 100 million unique keys. The requested data was skewed to test the robustness of the design. This is important as real systems can be expected to have an uneven access distribution. The results of the experiment can be seen in figure 10.

For point queries, the overhead of the fine-grained system lowers its throughput significantly compared to the coarse-grained and hybrid approach. For a higher number of clients, however, the impact of the skewed data access shows. The fine-grained system with its more evenly distributed data is almost unaffected. The throughput of the coarse-grained system, on the other hand, begins to decrease when 80 clients have been exceeded. And while the hybrid system does not increase its throughput anymore, it remains mostly stable and manages to be the highest throughput system for all tested numbers of clients.

For range queries, the designs behave differently. The coarse-grained system stagnates immediately. Both the fine-grained and hybrid systems appear to have linearly growing throughput with an increasing number of clients. The fine-grained system appears to be faster; however, the difference is only pronounced on queries with low selectivity. Otherwise, the hybrid approach is close to identical to the fine-grained approach.

The hybrid approach manages to take advantage of the strengths of the more extreme approaches while mitigating their weaknesses. While it is not the fastest in every regard, it is the most robust across the board and performed well independently of the workload.

Part 5: Conclusion

As shown above, database systems can be scaled significantly by connecting multiple servers via a network. The former bottleneck in the network connection can be overcome by using modern high-speed connections and communicating with RDMA to avoid costly overhead. This also enables logical separation of computation and memory so they can be scaled independently. Important memory structures and often accessed data can be distributed over multiple machines. An intelligent indexing structure ensures that the risk of performance losses to contention or load imbalance is kept to a minimum. With all that in place an easily scalable database backend for real-time oriented data analysis systems is possible.

Together with the other discussed components, such as an approximate query enabled middleware and intuitive user interfaces, this facilitates a simple to use real-time data analysis system.

References

[1] Zhicheng Liu, Jeffrey Heer,The Effects of Interactive Latency on Exploratory Visual Analysis, IEEE, p. 2122 - 2131, 2014