Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Metanome - Data Profiling

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.

The Metanome project is a project at HPI in cooperation with the Qatar Computing Reserach Institute (QCRI). Metanome provides a fresh view on data profiling by developing and integrating efficient algorithms into a common tool, expanding on the functionality of data profiling, and addressing performance and scalabilities issues for Big Data. A vision of the Metanome project appeared in SIGMOD Record "Data Profiling Revisited" and demo of the Metanome profiling tool was given at VLDB 2015 "Data Profiling with Metanome".

Algorithm Research

Active:

Past:

Tool Development

Active:

  • Tanja Bergmann (Backend, Frontend, and Architecture)
  • Vincent Schwarzer (Backend and Architecture)
  • Maxi Fischer (Backend and Frontend)

Past:

  • Moritz Finke (Backend and Architecture)
  • Carl Ambroselli (Frontend)
  • Jakob Zwiener (Backend and Architecture)
  • Claudia Exeler (Frontend)

Projects within Metanome

  • Unique column combination discovery
    As prerequisite for unique constraints and keys, UCCs are a basic piece of metadata for any table. The problem is of particular complexity when regarding the exponential number of column combinations. We adress the problem by parallelization and pruning strategies.
    This work is in collaboration with QCRI
  • Inclusion dependency discovery
    As prerequisite of foreign keys, INDs can tell us how tables within a schema can be connected. When regarding tables of different data sources, conditional IND discovery is of particular relevance.
    See also the completed Aladin project and publications by Jana Bauckmann et al., in particular our Spider algorithm.
  • Incremental dependency discovery
    We are extending our work on UCC and IND discovery to tables that receive incremental updates. The goal is to avoid a complete re-computation and restrict processing to relevant columns, records, and dependencies.
  • Profiling and Mining RDF data
    The <subject, predicate, object> data model of RDF necessitates new approaches to basic profiling and data mining methods. 
    See also: ProLOD++ demo
  • Functional dependency discovery
    Functional dependencies express relationships between attributes of a database relation and are extensively used in data analysis and database design, especially schema normalization. We contribute to research in this area by evaluating current state-of-the-art algorithms and developing faster and more scalable approaches.
    See also: FD algorithms
  • Order dependency discovery
    Order dependencies (ODs) describe a relationship of order between lists of attributes in a relational table. ODs can help to understand the semantics of datasets and the appli- cations producing them. The existence of an OD in a table can provide hints on which integrity constraints are valid for the domain of the data at hand. Moreover, order dependen- cies have applications in the field of query optimization by suggesting query rewrites.
    See also: OD algorithms

Teaching Data Profiling

Student projects

  • Master's project "Profiling Dynamic Data" (4 students, winter 16/17)
  • Master's project "Approximate Data Profiling" (10 students, summer 2015)
  • Master's project "Metadata Trawling" (4 students, winter 14/15)
  • Master's project "Joint Data Profiling" (4 students, winter 13/14)
  • Master's project "Piggy-back Profiling" (6 students, winter 13/14)
  • Bachelor's project "ProCSIA: Profiling column stores with IBM's Information Analyzer" (8 students, summer 2011)

Current and past master theses

  • Please see these links for ongoing and completed master's theses, many of which are in the data profiling area. All theses are available as pdf - just contact Felix Naumann.

Courses

Publications

1.
Thorsten Papenbrock, Felix Naumann
Proceedings of the International Conference on Management of Data (SIGMOD), 2016
2.
Philipp Langer and Felix Naumann
VLDB Journal, vol. 25(2):223-241 2016
3.
Jens Ehrlich, Mandy Roick, Lukas Schulze, Jakob Zwiener, Thorsten Papenbrock, and Felix Naumann
In Extending Database Technology (EDBT), pages 305-316, 2016
4.
Sebastian Kruse, Anja Jentzsch, Thorsten Papenbrock, Zoi Kaoudi, Jorge-Arnulfo Quiane-Ruiz, Felix Naumann
In Proceedings of the ACM SIGMOD conference (SIGMOD), 2016
5.
Ziawasch Abedjan, Lukasz Golab and Felix Naumann
In International Conference on Data Engineering (ICDE), 2016
file:196237
6.
Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiane-Ruiz, Felix Naumann
Proceedings of the VLDB Endowment, vol. 8(7):774-785 2015
7.
Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, Felix Naumann
Proceedings of the VLDB Endowment, vol. 8(10):1082-1093 2015
8.
Ziawasch Abedjan, Lukasz Golab, Felix Naumann
VLDB Journal, vol. 24(4):557-581 2015
9.
Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, Felix Naumann
Proceedings of the VLDB Endowment, vol. 8(12):1860-1871 2015
10.
Anja Jentzsch, Christian Dullweber, Pierpaolo Troiano, Felix Naumann
In In Proceedings of Posters and Demos Session, ISWC2015, Bethlehem, PA, USA, 2015
11.
Arvid Heise, Gjergji Kasneci, Felix Naumann
In Proceedings of the Conference on Information and Knowledge Management (CIKM), pages 959-968, 2014
12.
Ziawasch Abedjan, Patrick Schulze, Felix Naumann
In In Proceedings of the International Conference on Information and Knowledge Management (CIKM), Shanghai, China, pages 949-958, 2014
13.
Ziawasch Abedjan, Jorge-Arnulfo Quanie-Ruiz, Felix Naumann
In Proceedings of the IEEE International Conference on Data Engineering (ICDE), Chicago, IL, 2014
14.
Ziawasch Abedjan, Toni Gruetze, Anja Jentzsch, Felix Naumann
In Proceedings of the IEEE International Conference on Data Engineering (ICDE), Demo, Chicago, IL, 2014
15.
Ziawasch Abedjan, Felix Naumann
In Know@LOD Workshop in conjunction with ESWC, Creete, Greece, 2014 Selected for Best Workshop Paper Award.
16.
Benedikt Forchhammer, Anja Jentzsch, Felix Naumann
In In Proceedings of the International Workshop on Dataset PROFIling & fEderated Search for Linked Data (PROFILES) in conjunction with ESWC., Heraklion, Greece, 2014 Selected for Best Workshop Paper Award.
17.
Felix Naumann
SIGMOD Record, vol. 32(4):40-49 2013
18.
Ziawasch Abedjan, Felix Naumann
Datenbank-Spektrum (Special Issue on RDF Data Management), vol. 13(2):111–120 2013
19.
Ziawasch Abedjan, Felix Naumann
In Proceedings of the Extended Semantic Web Conference (ESWC), Montpellier, France, 2013
20.
Arvid Heise, Jorge-Arnulfo Quiane-Ruiz, Ziawasch Abedjan, Anja Jentzsch, Felix Naumann
In Proceedings of the VLDB Endowment (PVLDB), Hangzhou, China, 2013 Jorge's presentation at VLDB 2014 was awarded the "Excellent Presentation Award".
21.
Ziawasch Abedjan, Johannes Lorey, Felix Naumann
In Proceedings of the 21st International Conference on Information and Knowledge Management (CIKM), pages 1532-1536, Maui, Hawaii, USA, 2012
22.
Jana Bauckmann, Ziawasch Abedjan, Heiko Müller, Ulf Leser, Felix Naumann
In Proceedings of the International Conference on Information and Knowledge Management (CIKM), Maui, Hawaii, pages 2094-2098, 2012
23.
Christoph Böhm, Gjergji Kasneci, Felix Naumann
In Proceedings of the Conference on Information and Knowledge Management (CIKM), 2012
24.
Toni Gruetze, Christoph Böhm, Felix Naumann
In Proceedings of the 5th Linked Data on the Web (LDOW) Workshop at the 21th International World Wide Web Conference (WWW), Lyon, France, 4 2012
hpi.de/fileadmin/user_upload/fachgebiete/naumann/publications/2012/hcm_ldow_www.pdf
25.
Jana Bauckmann, Ziawasch Abedjan, Ulf Leser, Heiko Müller, Felix Naumann
Technical Report 62, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, 2012 ISBN 978-3-86956-212-4, ISSN 1613-5652
26.
Christoph Böhm, Johannes Lorey, Felix Naumann
Journal of Web Semantics: Science, Services and Agents on the World Wide Web, vol. 9(3):339-345 2011
27.
Ziawasch Abedjan, Felix Naumann
In Proceedings of the International Conference on Information and Knowledge Management (CIKM), Glasgow, UK, 2011
28.
Ziawasch Abedjan, Felix Naumann
In International Workshop on Search & Mining Entity-Relationship Data (SMER), Glasgow, UK, 2011
29.
Johannes Lorey, Ziawasch Abedjan, Felix Naumann, Christoph Böhm
In Billion Triples Challenge (BTC) at the 10th International Semantic Web Conference (ISWC), Koblenz, Germany, 2011 Finalist
30.
Ziawasch Abedjan, Felix Naumann
Technical Report 51, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, 2011 ISBN 978-3-86956-148-6, ISSN 1613-5652
31.
Jana Bauckmann, Ulf Leser, Felix Naumann
Technical Report 34, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, 2010 ISBN 978-3-86956-048-9, ISSN 1613-5652
32.
Christoph Böhm, Philip Groth, Ulf Leser
In Proceedings of the International Semantic Web Conference (ISWC), pages 91-96, 2009
33.
Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, Ulf Leser
In Proceedings of the International Workshop on the Web and Databases (WebDB), Providence, RI, 2009