Prof. Dr. Tilmann Rabl

Test Data Generator Tooling

Bachelor Project, Winter 2019/2020

Parallel Data Generator Framework (PDGF)

With the current rise of Machine Learning libraries and frameworks as well as general data processing and data analytics engines there is bigger and bigger need for good quality test data. For that reason bankmark UG (haftungsbeschränkt) has developed the Parallel Data Generator Framework (PDGF). The main goal of PDGF is to provide the means to create massively parallel and scalable data generation programs without the need of any low-level programming. For that PDGF provides an XML-based specification language and many generators, e.g., long number generator or data time generator, that can be used to specify a custom schema and along the logic and rules for generating test data for this particular schema.

The current way for developing such data generator programs involves writing schema specifications in XML. For the most part using XML works just fine; there is no need for an IDE or compiler tool-chain, the programs are verbose such that they can be understood by other people, and XML as a language is standardized and well-defined, which means there is a plethora of tools available aiding in the creation of XML documents. But the use of XML also has its shortcomings. Data schemas, for instance, are inherently typed, that is fields have types and records in turn consist of typed fields. Using XML does not make it easy to enforce these types. That means typing related bugs that could potentially be found in advance lead to run time exceptions. Another case where the usage of XML can be cumbersome is when more complicated logic is needed. There are mechanisms to inject Java code into the XML specifications but doing so without an IDE or the typical tooling is more inefficient than simply writing a program in Java with an IDE.

A solution is to provide a Java-API additional to the existing XML-API. But this means all the parsing logic needs to be duplicated; once for XML and once for Java. A good fit for resolving the duplication issue could be the Java Architecture for XML Binding (JAXB). With JAXB it is possible to simply annotate existing code to specify and identify the relevant interfaces, classes, and attributes and JAXB will then handle all the marshalling and unmarshalling including parsing. The PDGF code should be almost free of XML related code.

Project Outline

The main goal is to use and showcase the usefulness of the a Java-API for PDGF. The participants should develop several applications that leverage the new API. Some examples for these applications are:

•          a graphical user interface (GUI) for defining data generator programs,

•          a web-service for generating data,

•           a framework for unit testing that will generate appropriate data based on the type  signature.

Such an application should then be showcased with a runnable prototype. Even though all application are independent of each other, they should use a common API, which should be developed as part of the project. Ultimately, the goal is to identify problems with the PDGF architecture that make the use of JAXB complicated or even prevent its use. Ideally the XML-API itself should be left untouched from the perspective of the end-user for backwards compatibility.

External Partners

The project will be executed in cooperation with bankmark UG (haftungsbeschänkt) and potentially additional partners.


This project is a software engineering project, participants need some experience in software engineering and experience with at least one programming language, preferably Java. Additional knowledge of software development processes and build tools, such as Maven, would be preferable. Student should be comfortable in documenting their work with tools like JavaDoc, Wiki, and others.

Suggested Reading



Presenting an Open Source Load Generator for Web Applications

Four B.Sc. IT-Systems Engineering students invented an innovative approach for load testing web applications within the scope of their final project which was supervised by the Data Engineering Systems Group. The prototype WALT is based on the generic and highly parallel data generator PDGF which was provided by project partner bankmark. WALT enables software developers and administrators for the first time to investigate performance and resource usage of their respective web applications under a realistic workload. The project was presented per livestream at the Hasso-Plattner-Institute’s annual “Bachelorpodium” on July 9th, 2020.

Web applications are tested using load generators to determine how many users the provided resources like CPU or memory can satisfy. Omitting these tests can result in unavailability of the web application due to too high demand. Load generators simulate the behaviour of real users by sending requests to the system under test. These requests contain data like a user’s Email address or a keyword for a search. Existing load generators can only send requests using static data, which in sum only simulates a single user over and over. Some load generators are also able to randomise such data, but this results in different data being used for every run making the test runs incomparable. All in all, both existing concepts can only provide results with a limited expressiveness.

The load generator WALT which was developed in the scope of this project enables the tester to define complex usage scenarios that simulate different use cases of a web application using a GUI. Testers can import PDGF-generated data into each test run using the GUI. For example, if one simulated a web shop, one could simulate that 100,000 users sign up, log in, scroll through the special offers and search for a different product afterwards. A part of the simulated users can continue the process and buy the searched product. During the test run, measured response times and throughputs are evaluated in real time. When a test is finished, they can be exported and compared to the results of a previous run. When a test is repeated, PDGF can re-generate the same data and thus WALT simulates the same users again.

Despite WALT’s high complexity, it aimed at being as efficient as existing load generators for web applications were so that many users could be simulated with a limited amount of resources.  During the project, different technologies were evaluated, and the implementation uses the most efficient technology (Kotlin coroutines). Also, OpenJKD’s prototype Loom was evaluated as a future alternative, and valuable feedback could be contributed back to its developers. Experiments show that WALT meets its performance goals and performs better than three commonly used load generators.

WALT is available to the public for usage and contribution under an Open Source license.

The project was supervised by Prof. Dr. Tilmann Rabl and Lawrence Benson of the Data Engineering Systems Group and Christoph Brücke-Wendorff of bankmark.