The Hasso Plattner Institute offers a practically-oriented computer science study program at an internationally recognized institute. This study includes the Germany-wide unique IT-Systems Engineering program and the five master programs Cybersecurity, Data Engineering, Digital Health, IT-Systems Engineering and Software Systems Engineering.

Our researchers at HPI benefit from an inspiring scientific environment as well as a collaborative and inclusive atmosphere. In this environment, they obtain insights and findings that achieve societal impact. Our scientific work is structured within research clusters. In addition, we work together with scientific institutions, companies, and public institutions in numerous research programs worldwide.

The Hasso Plattner Institute in Potsdam is unique on the German academic landscape. The institute's program continues to grow with the support of its founder Hasso Plattner and through international cooperation. Find out more about the founder, events and studies at HPI.

The Hasso Plattner Institute has educational programs for both high school students and working professionals. It operates its own IT learning platform - openHPI - which provides free online courses. The Youth Academy organizes computer science camps and events for high school students. Professionals can take advantage of educational opportunities in the field of Design Thinking at the HPI Academy.

The press area of the Hasso Plattner Institute provides you with the latest press material, news, information on our social media channels and contact details.

HPI Colloquium: Dr. Thorsten Papenbrock "Data Profiling at Scale"

Dr. Thorsten Papenbrock, Hasso Plattner Institute, Potsdam; Lehrstuhl Informationssysteme

7 February 2019

Abstract
According to a CrowdFlower study [1] and common experience, data scientists spend about 80 percent of their time not on data science but data preparation. In industry, the same is true for IT-professionals, who aim to integrate, connect, and consolidate business data from third party sources. A major part of that data preparation effort is spend on understanding the struc-ture of the data and finding correspondences to existing datasets or schemata. This process, i.e., the search for structural patterns and dependencies is called data profiling and it involves various metadata discovery tasks of exponential complexity; some of them are even amongst the hardest tasks in computer science. Most automatic data profiling algorithms do, for this reason, not scale well with the volume of the data.
In this talk, I will provide an overview of our research in the field of data profiling and discuss the challenges ahead. We developed several algorithms that improved the efficiency of auto-matic data profiling by many orders of magnitude and published them with a practical tool called Metanome. This tool is used by various research groups and companies all over the world and we aim to drive its development further. The three main objectives for our future research are metadata interpretation (filtering, ranking, and selection), metadata application (data linkage, cleaning, integration, and query optimization), and metadata search paralleliza-tion/distribution (for scalability, robustness, and efficiency).

Short CV
Thorsten Papenbrock is a researcher and lecturer at the Hasso Plattner Institute. He received his M.Sc. in IT-Systems Engineering in 2014 and his Ph.D. in Computer Science in 2017. The goal of his research is to create efficient tools that make data accessible. For this purpose, he develops novel algorithms and approaches to profile, cleanse, integrate, link, and structure data. His interests also involve techniques for distributed and parallel computing, database systems, and data analytics. More details about his research and teaching activities can be found online [2].

Host: Prof. Dr. Felix Naumann

Event

07 February 2019
4:00 PM
Hasso Plattner Institute, H.2.57