Stratosphere

Stratosphere is a joint DFG project conducted by the Technische Universität Berlin, Humboldt Universität Berlin, and the Hasso-Plattner-Institut. It explores how the elasticity of clouds can be exploited for processing analytic queries massively in parallel. Unlike most traditional DBMS, Stratosphere inherently supports text-based and semi-structured data.

Official Project Site

The sub-projects at HPI focus on data quality improvements of linked open data, efficient and scalable data profiling, and knowledge discoevry.

Data Cleansing

We defined the declarative data cleansing language Meteor, implement the underlying basic operations, and develop cost estimations for the operations. Furthermore, we provide test data sets and example queries to evaluate the efficiency and effectivity of the data cleansing process.

Data Profiling

Detecting dependencies in the evergrowing amounts of data has a high computational complexity. One way to cope with this complexity is to distribute the computational work among multiple interconnected computers. However, most existing data profiling algorithms are not designed for parallel execution on computer clusters but rather to run on a single machine. Therefore, we research distributed modifications of existing algorithms as well as new algorithms that can be efficiently executed on computer clusters and that scale out on the number of the cluster nodes.

Knowledge Discovery

Driven by applications such as social media analytics, Web search, advertising, recommendation, mobile sensoring, genomic sequencing, astronomical observations, etc., the need for scalable learning, mining, and knowledge discovery methods is steadily growing. Often the challenge is to automatically process and analyze TBs of evolving data. Extracting value (e.g., understanding the underlying structure and making predictions) from such data, before it is outdated, is a major concern. Therefore, the goal is to enable the scalability of such applications based on Stratosphere.

Please contact Felix Naumann, Toni Grütze (Knowledge Discovery on Stratosphere), or Sebastian Kruse (Data Profiling on Stratosphere) for further questions.

Former members

Dr. Arvid Heise

Publications

What was Hillary Clinton doing in Katy, Texas?. Gruetze, Toni; Krestel, Ralf; Lazaridou, Konstantina; Naumann, Felix (2017).

[ Details ]

CohEEL: Coherent and Efficient Named Entity Linking through Random Walks. Gruetze, Toni; Kasneci, Gjergji; Zuo, Zhe; Naumann, Felix in Web Semantics: Science, Services and Agents on the World Wide Web (2016). 37(C) 75–89.

[ Details ]

Topic Shifts in StackOverflow: Ask it like Socrates. Gruetze, Toni; Krestel, Ralf; Naumann, Felix (2016). (Vol. 9612) 213–221.

[ Details ]

Scaling Out the Discovery of Inclusion Dependencies. Kruse, Sebastian; Papenbrock, Thorsten; Naumann, Felix (2015). 445–454.

[ Details ]

Progressive Duplicate Detection. Papenbrock, Thorsten; Heise, Arvid; Naumann, Felix in IEEE Transactions on Knowledge and Data Engineering (TKDE) (2015). 27(5) 1316–1329.

[ Details ]

Learning Temporal Tagging Behaviour. Gruetze, Toni; Yao, Gary; Krestel, Ralf (2015). 1333–1338.

[ Details ]

SOFA: An Extensible Logical Optimizer for UDF-heavy Data Flows. Rheinländer, Astrid; Heise, Arvid; Hueske, Fabian; Leser, Ulf; Naumann, Felix in Information Systems (2015). 52 96–125.

[ Details ]

Estimating the Number and Sizes of Fuzzy-Duplicate Clusters. Heise, Arvid; Kasneci, Gjergji; Naumann, Felix (2014). 959–968.

[ Details ]

The Stratosphere Platform for Big Data Analytics. Alexandrov, Alexander; Bergmann, Rico; Ewen, Stephan; Freytag, Johann-Christoph; Hueske, Fabian; Heise, Arvid; Kao, Odej; Leich, Marcus; Leser, Ulf; Markl, Volker; Naumann, Felix; Peters, Mathias; Rheinländer, Astrid; Sax, Matthias J.; Schelter, Sebastian; Höger, Mareike; Tzoumas, Kostas; Warneke, Daniel in The VLDB Journal (2014). 23(6) 939–964.

[ Details ]

Versatile optimization of UDF-heavy data flows with SOFA (demo). Rheinländer, Astrid; Beckmann, Martin; Kunkel, Anja; Heise, Arvid; Stoltmann, Thomas; Leser, Ulf (2014). 685–688.

[ Details ]

Reach for Gold: An Annealing Standard to Evaluate Duplicate Detection Results. Vogel, Tobias; Heise, Arvid; Draisbach, Uwe; Lange, Dustin; Naumann, Felix in JDIQ (2014). 5(1-2)

[ Details ]

Applying Stratosphere for Big Data Analytics. Leich, Marcus; Adamek, Jochen; Schubotz, Moritz; Heise, Arvid; Rheinlander, Astrid; Markl, Volker (2013).

[ Details ]

Scalable Discovery of Unique Column Combinations. Heise, Arvid; Quiane-Ruiz, Jorge-Arnulfo; Abedjan, Ziawasch; Jentzsch, Anja; Naumann, Felix (2013).

[ Details ]

SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows Rheinländer, Astrid; Heise, Arvid; Hueske, Fabian; Leser, Ulf; Naumann, Felix (2013). (Vol. abs/1311.6335)

[ Details ]

Meteor/Sopremo: An Extensible Query Language and Operator Model. Heise, Arvid; Rheinländer, Astrid; Leich, Marcus; Leser, Ulf; Naumann, Felix (2012).

[ Details ]

GovWILD: Integrating Open Government Data for Transparency (demo). Böhm, Christoph; Freitag, Markus; Heise, Arvid; Lehmann, Claudia; Mascher, Andrina; Naumann, Felix; Hernandez, Mauricio; Ercegovac, Vuk; Haase, Peter (2012).

[ Details ]

Stratosphere

Data Cleansing

Data Profiling

Knowledge Discovery

Former members

Publications

Chair

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

01.11.2025 | Paper accepted at WOP@ISWC

29.09.2025 | Paper accepted at NeurIPS 2025

29.09.2025 | Paper accepted at SIGMOD 2026

09.07.2025 | Paper accepted in SIGMOD Record

Project highlights

People and open positions