Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Sebastian Kruse

Research Assistant at Information Systems Group

Contact

Hasso-Plattner-Institut für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam, Germany

Phone: ++49 331 5509 240
Fax: ++49 331 5509 287
Room: G-3.1.13, Building G, Campus III
Email: Sebastian Kruse

Research Interests

  • Data profiling
  • Distributed systems
  • Map/Reduce frameworks
  • Query optimization
  • Cross-platform/polyglot data processing

Projects

Teaching

Master's Theses

  • Estimating Metadata of Query Results using Histograms (Cathleen Ramson, 2014)
  • Quicker Ways of Doing Fewer Things: Improved Index Structures and Algorithms for Data Profiling (Jakob Zwiener, 2015)
  • Methods of Denial Constraint Discovery (Tobias Bleifuß, 2016)
  • Optimizing Cross-Platform Iterations on 
    the Rheem Platform (Jonas Kemper, ongoing)

Seminars

Master Projects

Bachelor Projects

Guest Lectures

Professional Activities

Talks

Publications

Data Anamnesis: Admitting Raw Data into an Organization

Kruse, Sebastian; Papenbrock, Thorsten; Harmouch, Hazar; Naumann, Felix in IEEE Data Engineering Bulletin 2016 .

Today’s internet offers a plethora of openly available datasets, bearing great potential for novel applications and research. Likewise, rich datasets slumber within organizations. However, all too often those datasets are available only as raw dumps and lack proper documentation or even a schema. Data anamnesis is the first step of any effort to work with such datasets: It determines fundamental properties regarding the datasets’ content, structure, and quality to assess their utility and to put them to use appropriately. Detecting such properties is a key concern of the research area of data profiling, which has developed several viable instruments, such as data type recognition and foreign key discovery. In this article, we perform an anamnesis of the MusicBrainz dataset, an openly available and com- plex discographic database. In particular, we employ data profiling methods to create data summaries and then further analyze those summaries to reverse-engineer the database schema, to understand the data semantics, and to point out tangible schema quality issues. We propose two bottom-up schema quality dimensions, namely conciseness and normality, that measure the fit of the schema with its data, in contrast to a top-down approach that compares a schema with its application requirements.
Data_Anamnesis-_Admitting_Raw_Data_into_an_Organization.pdf
Further Information
Tags data_anamnesis data_profiling isg schema_discovery
BibTeX