Solving computationally complex tasks is a challenge and a central activity in data profiling. This involves primarily the discovery of metadata in many gigabyte-sized datasets, which is why algorithms developed for this purpose need to be efficient and robust. Because data profiling offers such a plethora of challenging, yet unsolved tasks, I have chosen it as my primary research area. I am in particular interested in the discovery of data dependencies, such as inclusion dependencies, unique column combinations, functional dependencies, order dependencies, matching dependencies, and many more.
Data is one of the most important assets in any company. Therefore, it is crucial to ensure its quality and reliability. Data cleansing and data profiling are two essential tasks that - if performed correctly and frequently - help to guarantee data fitness. In this area, I am particularly interested in (semi-)automatic duplicate detection methods and normalization techniques as well as their efficient implementation.
Parallel and Distributed Systems:
Due to the complexity of many tasks in IT, a clever algorithm alone is often not able to deliver a solution in time. In these cases, parallel and distributed systems are needed. Especially when facing ever larger datasets, i.e., big data, we need to consider technologies such as map-reduce (e.g. Spark and Flink), actors (e.g. Akka), and GPUs (e.g. CUDA and OpenCL) to implement scalability into our solutions.