Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Sebastian Kruse

Research Assistant at Information Systems Group

Contact

Hasso-Plattner-Institut für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam, Germany

Phone: ++49 331 5509 240
Fax: ++49 331 5509 287
Room: 2-01.2, Building E (formerly "Hinterer Neubau")
Email: Sebastian Kruse

Research Interests

  • Data profiling
  • Distributed systems
  • Map/Reduce frameworks
  • Query optimization
  • Cross-platform/polyglot data processing

Projects

Teaching

Master's Theses

  • Estimating Metadata of Query Results using Histograms (Cathleen Ramson, 2014)
  • Quicker Ways of Doing Fewer Things: Improved Index Structures and Algorithms for Data Profiling (Jakob Zwiener, 2015)
  • Methods of Denial Constraint Discovery (Tobias Bleifuß, 2016)

Seminars

Master Projects

  • Approximate Data Profiling (SS 15)

Bachelor Projects

Professional Activities

  • Member of GI (since 2015) and ACM (since 2016)
  • Reviewer for Information Systems Journal
  • Contributor to Apache Flink

Publications

Approximate Discovery of Functional Dependencies for Large Datasets

Tobias Bleifuß, Susanne Bülow, Johannes Frohnhofen, Julian Risch, Georg Wiese, Sebastian Kruse, Thorsten Papenbrock, Felix Naumann
In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), pages 1803-1812, 2016

Abstract:

Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution space, so that even today’s fastest FD discovery algorithms are limited to small datasets only, due to long runtimes and high memory consumptions. To overcome this situation, we propose an approximate discovery strategy that sacrifices possibly little result correctness in return for large performance improvements. In particular, we introduce AID-FD, an algorithm that approximately discovers FDs within runtimes up to orders of magnitude faster than state-of-the-art FD discovery algorithms. We evaluate and compare our performance results with a focus on scalability in runtime and memory, and with measures for completeness, correctness, and minimality.

BibTeX file

@inproceedings{bleifuss2016approximate,
author = { Tobias Bleifuß, Susanne Bülow, Johannes Frohnhofen, Julian Risch, Georg Wiese, Sebastian Kruse, Thorsten Papenbrock, Felix Naumann },
title = { Approximate Discovery of Functional Dependencies for Large Datasets },
year = { 2016 },
pages = { 1803-1812 },
month = { 0 },
abstract = { Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution space, so that even today’s fastest FD discovery algorithms are limited to small datasets only, due to long runtimes and high memory consumptions. To overcome this situation, we propose an approximate discovery strategy that sacrifices possibly little result correctness in return for large performance improvements. In particular, we introduce AID-FD, an algorithm that approximately discovers FDs within runtimes up to orders of magnitude faster than state-of-the-art FD discovery algorithms. We evaluate and compare our performance results with a focus on scalability in runtime and memory, and with measures for completeness, correctness, and minimality. },
booktitle = { Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Thu, 10 Nov 2016 20:29:12 +0100