Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Sebastian Kruse

former member

Research Interests

  • Data profiling
  • Distributed systems
  • Map/Reduce frameworks
  • Query optimization
  • Cross-platform/polyglot data processing

Projects

Teaching

Master's Theses

  • Estimating Metadata of Query Results using Histograms (Cathleen Ramson, 2014)
  • Quicker Ways of Doing Fewer Things: Improved Index Structures and Algorithms for Data Profiling (Jakob Zwiener, 2015)
  • Methods of Denial Constraint Discovery (Tobias Bleifuß, 2016)
  • Optimizing Cross-Platform Iterations on 
    the Rheem Platform (Jonas Kemper, ongoing)

Seminars

Master Projects

Bachelor Projects

Guest Lectures

Professional Activities

Talks

Publications

Efficient Discovery of Approximate Dependencies

Kruse, Sebastian; Naumann, Felix in Proceedings of the VLDB Endowment 2018 . See abstract for errata

Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Realworld dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory. --------------------- Errata / Corrigendum for Efficient Discovery of Approximate Dependencies Sebastian Kruse and Felix Naumann Proceedings of the VLDB Endowment 11 (7), 759-772 Readers of the paper have pointed out a few minor errors, which we document here to ease the understanding of the algorithm. Erratum 1) In Section 5.1, the PLI for “Last name” should read 1,4, 3,5. Erratum 2) In Section 5.3, Example 4, the tuple pairs (t1, t3), (t1, t5), and (t2, t3) should yield the agree set sample AS = (, 1), (First_name, Town, 1), (ZIP, 1)). Erratum 3) In Section 5.3, the example AUCC error of the attribute combination A1...An should be 0.0099.
Efficient Discovery of Ap... - Download
Further Information
Tags data_profiling  dependency_discovery  isg