Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Dr. Ziawasch Abedjan

 

former member

 


 

 

 

 

Please visit my new page at MIT CSAIL for the most recent news.

Research Activities

Topics

  • Data Profiling
  • Data Mining

Projects

Master's Thesis Co-Supervision

  • Benjamin Emde: "Context-Aware Recommendations in Social Networks", 2012
  • Sven Viehmeier: "Incremental Data profiling", 2012/2013
  • Patrick Schulze: "Depth-First Discovery of Functional Dependencies" 2013/2014

Master Project Co-Supervision

  • "Global Relevance Scores for DBpedia Facts", 2012/2013

Teaching

Publications

Scalable Discovery of Unique Column Combinations

Arvid Heise, Jorge-Arnulfo Quiane-Ruiz, Ziawasch Abedjan, Anja Jentzsch, Felix Naumann
In Proceedings of the VLDB Endowment (PVLDB), Hangzhou, China, 2013 Jorge's presentation at VLDB 2014 was awarded the "Excellent Presentation Award".

Abstract:

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise DUCC, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows DUCC to typically depend on the solution set size and hence to prune large swaths of the lattice. DUCC also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, DUCC runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate DUCC using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare DUCC with related work: Gordian and HCA. The results show that DUCC is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of DUCC to scale up and out.

BibTeX file

@inproceedings{heise_duxx_2014,
author = { Arvid Heise, Jorge-Arnulfo Quiane-Ruiz, Ziawasch Abedjan, Anja Jentzsch, Felix Naumann },
title = { Scalable Discovery of Unique Column Combinations },
year = { 2013 },
month = { 0 },
abstract = { The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself. In this paper, we devise DUCC, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows DUCC to typically depend on the solution set size and hence to prune large swaths of the lattice. DUCC also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, DUCC runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate DUCC using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare DUCC with related work: Gordian and HCA. The results show that DUCC is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of DUCC to scale up and out. },
note = { Jorge's presentation at VLDB 2014 was awarded the "Excellent Presentation Award". },
address = { Hangzhou, China },
booktitle = { Proceedings of the VLDB Endowment (PVLDB) },
priority = { 0 }
}

Copyright Notice

last change: Wed, 17 Feb 2016 08:29:47 +0100

Review Activity

  • ACM Transactions on the Web
  • DESWeb 2014