The Hasso Plattner Institute offers a practically-oriented computer science study program at an internationally recognized institute. This study includes the Germany-wide unique IT-Systems Engineering program and the five master programs Cybersecurity, Data Engineering, Digital Health, IT-Systems Engineering and Software Systems Engineering.

Our researchers at HPI benefit from an inspiring scientific environment as well as a collaborative and inclusive atmosphere. In this environment, they obtain insights and findings that achieve societal impact. Our scientific work is structured within research clusters. In addition, we work together with scientific institutions, companies, and public institutions in numerous research programs worldwide.

The Hasso Plattner Institute in Potsdam is unique on the German academic landscape. The institute's program continues to grow with the support of its founder Hasso Plattner and through international cooperation. Find out more about the founder, events and studies at HPI.

The Hasso Plattner Institute has educational programs for both high school students and working professionals. It operates its own IT learning platform - openHPI - which provides free online courses. The Youth Academy organizes computer science camps and events for high school students. Professionals can take advantage of educational opportunities in the field of Design Thinking at the HPI Academy.

The press area of the Hasso Plattner Institute provides you with the latest press material, news, information on our social media channels and contact details.

Arvid Heise

Data Cleansing and Integration Operators for a Parallel Data Analytics Platform

The data quality of real-world datasets need to be constantly monitored and maintained to allow organizations and individuals to reliably use their data. Especially, data integration projects suffer from poor initial data quality and as a consequence consume more effort and money. Commercial products and research prototypes for data cleansing and integration help users to improve the quality of individual and combined datasets. They can be divided into either standalone systems or database management system (DBMS) extensions. On the one hand, standalone systems do not interact well with DBMS and require time-consuming data imports and exports. On the other hand, DBMS extensions are often limited by the underlying system and do not cover the full set of data cleansing and integration tasks.

We overcome both limitations by implementing a concise set of five data cleansing and integration operators on the parallel data analytics platform Stratosphere. We define the semantics of the operators, present their parallel implementation, and devise optimization techniques for individual operators and combinations thereof. Users specify declarative queries in our query language METEOR with our new operators to improve the data quality of individual datasets or integrate them to larger datasets. By integrating the data cleansing operators into the higher level language layer of Stratosphere, users can easily combine cleansing operators with operators from other domains, such as information extraction, to complex data ows. Through a generic description of the operators, the Stratosphere optimizer reorders operators even from different domains to find better query plans.

As a case study, we reimplemented a part of the large Open Government Data integration project GovWILD with our new operators and show that our queries run significantly faster than the original GovWILD queries, which rely on relational operators. Evaluation reveals that our operators exhibit good scalability on up to 100 cores, so that even larger inputs can be efficiently processed by scaling out to more machines. Finally, our scripts are considerably shorter than the original GovWILD scripts, which results in better maintainability of the scripts.

Ombudsperson

Ombudspersons serve as neutral and qualified advisors in questions of good scientific practice and in suspected cases of scientific misconduct.

As far as possible, they contribute to solution-oriented conflict mediation.

If you have any questions, please contact:

Prof. Dr. Tilmann Rabl

Tel.: +49 (0)331 5509-280
E-Mail: tilmann.rabl(at)hpi.de

Future SOC Lab

The “HPI Future SOC Lab” is a cooperation of the Hasso-Plattner-Institut (HPI) and industrial partners. Its mission is to enable and promote exchange and interaction between the research community and the industrial partners.

Further Information

Research Schools

The HPI Research Schools for "Service-Oriented Systems Engineering" and "Data Science and Engineering" have branches in Cape Town, Haifa, Irvine and Nanjing.

Further Information

Digital Health Cluster

The Digital Health Cluster of the Hasso Plattner Institut (HPI) brings together individuals from health sciences, human sciences, data sciences, digital engineering and society with a shared goal to improve health and wellbeing.

Further Information