Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Sebastian Kruse

Research Assistant at Information Systems Group

Contact

Hasso-Plattner-Institut für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam, Germany

Phone: ++49 331 5509 240
Fax: ++49 331 5509 287
Room: G-3.1.13, Building G, Campus III
Email: Sebastian Kruse

Research Interests

  • Data profiling
  • Distributed systems
  • Map/Reduce frameworks
  • Query optimization
  • Cross-platform/polyglot data processing

Projects

Teaching

Master's Theses

  • Estimating Metadata of Query Results using Histograms (Cathleen Ramson, 2014)
  • Quicker Ways of Doing Fewer Things: Improved Index Structures and Algorithms for Data Profiling (Jakob Zwiener, 2015)
  • Methods of Denial Constraint Discovery (Tobias Bleifuß, 2016)
  • Optimizing Cross-Platform Iterations on 
    the Rheem Platform (Jonas Kemper, ongoing)

Seminars

Master Projects

Bachelor Projects

Guest Lectures

Professional Activities

Talks

Publications

RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets

Kruse, Sebastian; Jentzsch, Anja; Papenbrock, Thorsten; Kaoudi, Zoi; Quiane-Ruiz, Jorge-Arnulfo; Naumann, Felix in Proceedings of the ACM SIGMOD conference (SIGMOD) 2016 .

Inclusion dependencies (inds) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (cinds), which define including and included data in terms of conditions, allow to transfer these capabilities to rdf data. However, cind discovery is computationally much more complex than ind discovery and the number of cinds even on small rdf datasets is intractable. To cope with both problems, we first introduce the notion of pertinent cinds with an adjustable relevance criterion to filter and rank cinds based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent cinds in rdf data. RDFind employs a lazy pruning strategy to drastically reduce the cind search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of cinds. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.
RDFind-_Scalable_Conditional_Inclusion_Dependency_Discovery_in_RDF_Datasets.pdf
Further Information
Tags hpi inclusion_dependencies isg profiling rdfind
BibTeX