Sebastian Kruse

former member

Research Interests

Data profiling
Distributed systems
Map/Reduce frameworks
Query optimization
Cross-platform/polyglot data processing

Projects

Stratosphere II
Data Profiling with Metanome and Metacrate
Rheem (while interning at QCRI)

Teaching

Master's Theses

Estimating Metadata of Query Results using Histograms (Cathleen Ramson, 2014)
Quicker Ways of Doing Fewer Things: Improved Index Structures and Algorithms for Data Profiling (Jakob Zwiener, 2015)
Methods of Denial Constraint Discovery (Tobias Bleifuß, 2016)
Optimizing Cross-Platform Iterations on
the Rheem Platform (Jonas Kemper, ongoing)

Seminars

Master Projects

Approximate Data Profiling (SS 15)
Profiling Dynamic Data (WS 16/17)

Bachelor Projects

Data Refinery (WS 15/16-SS 16)

Guest Lectures

Professional Activities

Member of GI (since 2015) and ACM (since 2016)
Reviewed for Information Systems Journal, VLDB Journal, and TKDE
Contributor to Apache Flink, Rheem, Metanome, and Metacrate

Talks

https://spark-summit.org/2017/events/interoperating-a-zoo-of-data-processing-platforms-using-rheem/Talk on Rheem (Spark Summit '17)
Fast Approximate Discovery of Inclusion Dependencies (BTW '17)
Tutorial on Rheem (BOSS '16 in conjunction with VLDB '16)
RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets (SIGMOD '16)
Estimating Data Integration and Cleaning Effort (EDBT '15)
Scaling Out the Discovery of Inclusion Dependencies (BTW '15)

Publications

2020

Kruse, S., Kaoudi, Z., Quiane-Ruiz, J.-A., Chawla, S., Naumann, F., Contreras-Rojas, B.: RHEEMix in the Data Jungle: A Cost-based Optimizer for Cross-Platform Systems. VLDB Journal. 29, 1287–1310 (2020).

[ BibTeX ] [ URL ] [ Details ]

2019

Schirmer, P., Papenbrock, T., Kruse, S., Naumann, F., Hempfing, D., Mayer, T., Neuschäfer-Rube, D.: DynFD: Functional Dependency Discovery in Dynamic Datasets. Proceedings of the International Conference on Extending Database Technology (EDBT). pp. 253–264 (2019).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ] [ Details ]

Kruse, S., Kaoudi, Z., Quiané-Ruiz, J.-A., Chawla, S., Naumann, F., Contreras-Rojas, B.: Optimizing Cross-Platform Data Movement. Proceedings of the International Conference on Data Engineering (ICDE). pp. 1642–1645 (2019).

[ BibTeX ] [ Download ] [ Details ]

2018

Kruse, S., Naumann, F.: Efficient Discovery of Approximate Dependencies. Proceedings of the VLDB Endowment. 11, 759–772 (2018).

[ Abstract ] [ BibTeX ] [ Download ] [ Details ]

@article{kruse2018efficient,
  abstract = {Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Realworld dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory. --------------------- Errata / Corrigendum for Efficient Discovery of Approximate Dependencies Sebastian Kruse and Felix Naumann Proceedings of the VLDB Endowment 11 (7), 759-772 Readers of the paper have pointed out a few minor errors, which we document here to ease the understanding of the algorithm. Erratum 1) In Section 5.1, the PLI for “Last name” should read 1,4, 3,5. Erratum 2) In Section 5.3, Example 4, the tuple pairs (t1, t3), (t1, t5), and (t2, t3) should yield the agree set sample AS = (, 1), (First_name, Town, 1), (ZIP, 1)). Erratum 3) In Section 5.3, the example AUCC error of the attribute combination A1...An should be 0.0099.},
  author = {Kruse, Sebastian and Naumann, Felix},
  journal = {Proceedings of the VLDB Endowment},
  keywords = {dependency_discovery data_profiling isg},
  note = {See abstract for errata},
  number = 7,
  pages = {759-772},
  title = {Efficient Discovery of Approximate Dependencies},
  volume = 11,
  year = 2018
}

Agrawal, D., Chawla, S., Kaoudi, Z., Kruse, S., Quiané-Ruiz, J.A., Contreras-Rojas, B., Elmagarmid, A., Idris, Y., Lucas, J., Mansour, E., Ouzzani, M., Papotti, P., Tang, N., Thirumuruganathan, S., Troudi, A.: RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -. Proceedings of the VLDB Endowment (PVLDB). 11, (2018).

[ Abstract ] [ BibTeX ] [ Download ] [ Details ]

@article{agrawal2018rheem,
  abstract = {Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) a robust interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.},
  author = {Agrawal, Divy and Chawla, Sanjay and Kaoudi, Zoi and Kruse, Sebastian and Quiané-Ruiz, Jorge Arnulfo and Contreras-Rojas, Bertty and Elmagarmid, Ahmed and Idris, Yasser and Lucas, Ji and Mansour, Essam and Ouzzani, Mourad and Papotti, Paolo and Tang, Nan and Thirumuruganathan, Saravanan and Troudi, Anis},
  journal = {Proceedings of the VLDB Endowment (PVLDB)},
  keywords = {rheem myown isg},
  number = 11,
  title = {RHEEM: Enabling Cross-Platform Data Processing - May The Big Data Be With You! -},
  volume = 11,
  year = 2018
}

2017

Bleifuß, T., Kruse, S., Naumann, F.: Efficient Denial Constraint Discovery with Hydra. Proceedings of the VLDB Endowment (PVLDB). 11, 311–323 (2017).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ] [ Details ]

Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast Approximate Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 207–226 (2017).

[ BibTeX ] [ Download ] [ Details ]

Kruse, S., Hahn, D., Walter, M., Naumann, F.: Metacrate: Organize and Analyze Millions of Data Profiles. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 2483–2486. ACM (2017).

[ BibTeX ] [ Download ] [ Details ]

2016

Bleifuß, T., Bülow, S., Frohnhofen, J., Risch, J., Wiese, G., Kruse, S., Papenbrock, T., Naumann, F.: Approximate Discovery of Functional Dependencies for Large Datasets. Proceedings of the International Conference on Information and Knowledge Management (CIKM). pp. 1803–1812. ACM, New York, NY, USA (2016).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ] [ Details ]

Kruse, S., Papenbrock, T., Harmouch, H., Naumann, F.: Data Anamnesis: Admitting Raw Data into an Organization. IEEE Data Engineering Bulletin. 39, 8–20 (2016).

[ Abstract ] [ BibTeX ] [ Download ] [ Details ]

Kruse, S., Jentzsch, A., Papenbrock, T., Kaoudi, Z., Quiane-Ruiz, J.-A., Naumann, F.: RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets. Proceedings of the International Conference on Management of Data (SIGMOD). pp. 953–967. ACM, New York, NY, USA (2016).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ] [ Details ]

Agrawal, D., Ba, L., Berti-Equille, L., Chawla, S., Elmagarmid, A., Hammady, H., Idris, Y., Kaoudi, Z., Khayyat, Z., Kruse, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-A., Tang, N., Zaki, M.J.: Rheem: Enabling Multi-Platform Task Execution (demo). Proceedings of the ACM SIGMOD conference (SIGMOD) (2016).

[ BibTeX ] [ Download ] [ Details ]

2015

Papenbrock, T., Kruse, S., Quiane-Ruiz, J.-A., Naumann, F.: Divide & Conquer-based Inclusion Dependency Discovery. Proceedings of the VLDB Endowmen. 8, 774–785 (2015).

[ Abstract ] [ BibTeX ] [ URL ] [ Download ] [ Details ]

Kruse, S., Papotti, P., Naumann, F.: Estimating Data Integration and Cleaning Effort. Proceedings of the International Conference on Extending Database Technology (EDBT) (2015).

[ Abstract ] [ BibTeX ] [ Download ] [ Details ]

Kruse, S., Papenbrock, T., Naumann, F.: Scaling Out the Discovery of Inclusion Dependencies. Proceedings of the conference on Database Systems for Business, Technology, and Web (BTW). pp. 445–454 (2015).

[ Abstract ] [ BibTeX ] [ Download ] [ Details ]

2014

Meyer, A., Pufahl, L., Batoulis, K., Kruse, S., Lindhauer, T., Stoff, T., Fahland, D., Weske, M.: Data Perspective in Process Choreographies: Modeling and Execution. 26th International Conference on Advanced Information Systems Engineering. , Thessaloniki, Greece (2014).

[ BibTeX ] [ Details ]

Sebastian Kruse

Research Interests

Projects

Teaching

Professional Activities

Talks

Publications

Chair

News

06.10.2024 | Paper accepted at EDBT 2025

06.09.2024 | Congratulations Dr. Phillip Wenig

06.09.2024 | Congratulations Dr. Mazhar Hameed!

16.07.2024 | Congratulations Dr. Leon Bornemann-Paulus!

23.05.2024 | Paper accepted at NLDB 2024

Project highlights

People and open positions