Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Sebastian Kruse

Research Assistant at Information Systems Group

Contact

Hasso-Plattner-Institut für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam, Germany

Phone: ++49 331 5509 240
Fax: ++49 331 5509 287
Room: 2-01.2, Building E (formerly "Hinterer Neubau")
Email: Sebastian Kruse

Research Interests

  • Data profiling
  • Distributed systems
  • Map/Reduce frameworks
  • Query optimization
  • Cross-platform/polyglot data processing

Projects

Teaching

Master's Theses

  • Estimating Metadata of Query Results using Histograms (Cathleen Ramson, 2014)
  • Quicker Ways of Doing Fewer Things: Improved Index Structures and Algorithms for Data Profiling (Jakob Zwiener, 2015)
  • Methods of Denial Constraint Discovery (Tobias Bleifuß, 2016)

Seminars

Master Projects

  • Approximate Data Profiling (SS 15)

Bachelor Projects

Professional Activities

  • Member of GI (since 2015) and ACM (since 2016)
  • Reviewer for Information Systems Journal
  • Contributor to Apache Flink

Publications

RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets

Sebastian Kruse, Anja Jentzsch, Thorsten Papenbrock, Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Felix Naumann
In Proceedings of the ACM SIGMOD conference (SIGMOD), 2016

Abstract:

Inclusion dependencies (inds) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (cinds), which define including and included data in terms of conditions, allow to transfer these capabilities to rdf data. However, cind discovery is computationally much more complex than ind discovery and the number of cinds even on small rdf datasets is intractable. To cope with both problems, we first introduce the notion of pertinent cinds with an adjustable relevance criterion to filter and rank cinds based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent cinds in rdf data. RDFind employs a lazy pruning strategy to drastically reduce the cind search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of cinds. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.

Keywords:

profiling,inclusion dependencies,rdfind,hpi

BibTeX file

@inproceedings{kruse2016rdfind,
author = { Sebastian Kruse, Anja Jentzsch, Thorsten Papenbrock, Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Felix Naumann },
title = { RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets },
year = { 2016 },
month = { 0 },
abstract = { Inclusion dependencies (inds) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (cinds), which define including and included data in terms of conditions, allow to transfer these capabilities to rdf data. However, cind discovery is computationally much more complex than ind discovery and the number of cinds even on small rdf datasets is intractable. To cope with both problems, we first introduce the notion of pertinent cinds with an adjustable relevance criterion to filter and rank cinds based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent cinds in rdf data. RDFind employs a lazy pruning strategy to drastically reduce the cind search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of cinds. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before. },
keywords = { profiling,inclusion dependencies,rdfind,hpi },
booktitle = { Proceedings of the ACM SIGMOD conference (SIGMOD) },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Mon, 04 Jul 2016 10:45:56 +0200