Metanome - Data Profiling

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.

The Metanome project is a project at HPI in cooperation with the Qatar Computing Reserach Institute (QCRI). Metanome provides a fresh view on data profiling by developing and integrating efficient algorithms into a common tool, expanding on the functionality of data profiling, and addressing performance and scalability issues for Big Data. A vision of the Metanome project appeared in SIGMOD Record "Data Profiling Revisited" and demo of the Metanome profiling tool was given at VLDB 2015 "Data Profiling with Metanome" (Please cite as BibTex/EndNote/ACM Ref) .

Tool and Algorithms

Metanome Tool and Profiling Algorithms

Repeatability

Data

Algorithm Research

Active:

Felix Naumann (Project lead)
Hazar Harmouch (Single Column Profiling)
Thorsten Papenbrock (IND, UCC and FD discovery; Metanome architecture)
Tobias Bleifuß (DC discovery)

Past:

Anja Jentzsch (RDF profiling and IND discovery)
Arvid Heise (UCC discovery)
Fabian Tschirschnitz (IND discovery)
Jens Ehrlich (Conditional UCC discovery)
Jorge-Arnulfo Quiané-Ruiz (@QCRI; UCC and IND discovery)
Maximilian Grundke (CFD discovery)
Moritz Finke (Approximate FD/IND discovery and FD ranking)
Philipp Langer (OD discovery)
Philipp Schirmer (MD discovery)
Sebastian Kruse (IND discovery; Metadata Store)
Tim Draeger (MVD discovery)
Ziawasch Abedjan (UCC discovery)

Tool Development

Active:

Joana Bergsiek (Backend and Frontend)

Past:

Carl Ambroselli (Frontend)
Claudia Exeler (Frontend)
Jakob Zwiener (Backend and Architecture)
Maxi Fischer (Backend and Frontend)
Moritz Finke (Backend and Architecture)
Tanja Bergmann (Backend, Frontend, and Architecture)
Vincent Schwarzer (Backend and Architecture)

Projects within Metanome

Unique column combination discovery
As prerequisite for unique constraints and keys, UCCs are a basic piece of metadata for any table. The problem is of particular complexity when regarding the exponential number of column combinations. We adress the problem by parallelization and pruning strategies.
This work is in collaboration with QCRI.
Inclusion dependency discovery
As prerequisite of foreign keys, INDs can tell us how tables within a schema can be connected. When regarding tables of different data sources, conditional IND discovery is of particular relevance.
See also the completed Aladin project and publications by Jana Bauckmann et al., in particular our Spider algorithm.
Incremental dependency discovery
We are extending our work on UCC and IND discovery to tables that receive incremental updates. The goal is to avoid a complete re-computation and restrict processing to relevant columns, records, and dependencies.
Profiling and Mining RDF data
The <subject, predicate, object> data model of RDF necessitates new approaches to basic profiling and data mining methods.
See also: ProLOD++ demo
Functional dependency discovery
Functional dependencies express relationships between attributes of a database relation and are extensively used in data analysis and database design, especially schema normalization. We contribute to research in this area by evaluating current state-of-the-art algorithms and developing faster and more scalable approaches.
See also: FD algorithms
Order dependency discovery
Order dependencies (ODs) describe a relationship of order between lists of attributes in a relational table. ODs can help to understand the semantics of datasets and the applications producing them. The existence of an OD in a table can provide hints on which integrity constraints are valid for the domain of the data at hand. Moreover, order dependencies have applications in the field of query optimization by suggesting query rewrites.
See also: OD algorithms

Teaching Data Profiling

Student projects

Master's project "Profiling Dynamic Data" (4 students, winter 16/17)
Master's project "Approximate Data Profiling" (10 students, summer 2015)
Master's project "Metadata Trawling" (4 students, winter 14/15)
Master's project "Joint Data Profiling" (4 students, winter 13/14)
Master's project "Piggy-back Profiling" (6 students, winter 13/14)
Bachelor's project "ProCSIA: Profiling column stores with IBM's Information Analyzer" (8 students, summer 2011)

Current and past master theses

Please see these links for ongoing and completed master's theses, many of which are in the data profiling area. All theses are available as pdf - just contact Felix Naumann.

Courses

Master's seminar "Advanced Data Profiling" (winter 17/18)
Master's course "Data Profiling" (summer 17)
PhD course "Data Profiling" at University of Trento (summer 15)
Master's course "Data profiling and data cleansing" (winter 14/15)
Master's course "Data profiling and data cleansing" (summer 13)
Master's seminar "Advanced data profiling" (winter 13/14)
Master's seminar "Linked Data Profiling" (summer 2009)

Publications

A Hybrid Approach for Efficient Unique Column Combination Discovery. Papenbrock, Thorsten; Naumann, Felix (2017). 195–204.

[ Details ]

Data-driven Schema Normalization. Papenbrock, Thorsten; Naumann, Felix (2017). 342–353.

[ Details ]

A Hybrid Approach to Functional Dependency Discovery. Papenbrock, Thorsten; Naumann, Felix in SIGMOD ’16 (2016). 821–833.

[ Details ]

Data Profiling (tutorial). Ziawasch Abedjan, Lukasz Golab; Naumann, Felix (2016).

[ Details ]

RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets. Kruse, Sebastian; Jentzsch, Anja; Papenbrock, Thorsten; Kaoudi, Zoi; Quiane-Ruiz, Jorge-Arnulfo; Naumann, Felix in SIGMOD ’16 (2016). 953–967.

[ Details ]

Efficient Order Dependency Discovery. Langer, Philipp; Naumann, Felix in VLDB Journal (2016). 25(2) 223–241.

[ Details ]

Holistic Data Profiling: Simultaneous Discovery of Various Metadata. Ehrlich, Jens; Roick, Mandy; Schulze, Lukas; Zwiener, Jakob; Papenbrock, Thorsten; Naumann, Felix (2016). 305–316.

[ Details ]

Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Papenbrock, Thorsten; Ehrlich, Jens; Marten, Jannik; Neubert, Tommy; Rudolph, Jan-Peer; Schönberg, Martin; Zwiener, Jakob; Naumann, Felix in Proceedings of the VLDB Endowment (2015). 8(10) 1082–1093.

[ Details ]

Profiling relational data: a survey. Abedjan, Ziawasch; Golab, Lukasz; Naumann, Felix in VLDB Journal (2015). 24(4) 557–581.

[ Details ]

Divide & Conquer-based Inclusion Dependency Discovery. Papenbrock, Thorsten; Kruse, Sebastian; Quiane-Ruiz, Jorge-Arnulfo; Naumann, Felix in Proceedings of the VLDB Endowmen (2015). 8(7) 774–785.

[ Details ]

Exploring Linked Data Graph Structures. Jentzsch, Anja; Dullweber, Christian; Troiano, Pierpaolo; Naumann, Felix (2015).

[ Details ]

Data Profiling with Metanome. Papenbrock, Thorsten; Bergmann, Tanja; Finke, Moritz; Zwiener, Jakob; Naumann, Felix in Proceedings of the VLDB Endowment (2015). 8(12) 1860–1871.

[ Details ]

Estimating the Number and Sizes of Fuzzy-Duplicate Clusters. Heise, Arvid; Kasneci, Gjergji; Naumann, Felix (2014). 959–968.

[ Details ]

DFD: Efficient Discovery of Functional Dependencies. Abedjan, Ziawasch; Schulze, Patrick; Naumann, Felix (2014). 949–958.

[ Details ]

Amending RDF Entities with New Facts. Abedjan, Ziawasch; Naumann, Felix (2014).

[ Details ]

LODOP - Multi-Query Optimization for Linked Data Profiling Queries. Forchhammer, Benedikt; Jentzsch, Anja; Naumann, Felix (2014).

[ Details ]

Profiling and Mining RDF Data with ProLOD++. Abedjan, Ziawasch; Gruetze, Toni; Jentzsch, Anja; Naumann, Felix (2014).

[ Details ]

Detecting Unique Column Combinations on Dynamic Data. Abedjan, Ziawasch; Quanie-Ruiz, Jorge-Arnulfo; Naumann, Felix (2014).

[ Details ]

Synonym Analysis for Predicate Expansion. Abedjan, Ziawasch; Naumann, Felix (2013).

[ Details ]

Improving RDF Data through Association Rule Mining. Abedjan, Ziawasch; Naumann, Felix in Datenbank-Spektrum (Special Issue on RDF Data Management) (2013). 13(2) 111–120.

[ Details ]

Scalable Discovery of Unique Column Combinations. Heise, Arvid; Quiane-Ruiz, Jorge-Arnulfo; Abedjan, Ziawasch; Jentzsch, Anja; Naumann, Felix (2013).

[ Details ]

Data Profiling Revisited. Naumann, Felix in SIGMOD Record (2013). 32(4) 40–49.

[ Details ]

Holistic and Scalable Ontology Alignment for Linked Open Data. Gruetze, Toni; Böhm, Christoph; Naumann, Felix (2012).

[ Details ]

Discovering Conditional Inclusion Dependencies. Bauckmann, Jana; Abedjan, Ziawasch; Müller, Heiko; Leser, Ulf; Naumann, Felix (2012). 2094–2098.

[ Details ]

Latent Topics in Graph-Structured Data. Böhm, Christoph; Kasneci, Gjergji; Naumann, Felix (2012).

[ Details ]

Reconciling Ontologies and the Web of Data. Abedjan, Ziawasch; Lorey, Johannes; Naumann, Felix (2012). 1532–1536.

[ Details ]

Covering or complete? : discovering conditional inclusion dependencies. Technical Report (62), Bauckmann, Jana; Abedjan, Ziawasch; Leser, Ulf; Müller, Heiko; Naumann, Felix (2012).

[ Details ]

Advancing the Discovery of Unique Column Combinations. Abedjan, Ziawasch; Naumann, Felix (2011).

[ Details ]

Creating voiD Descriptions for Web-scale Data. Böhm, Christoph; Lorey, Johannes; Naumann, Felix in Journal of Web Semantics: Science, Services and Agents on the World Wide Web (2011). 9(3) 339–345.

[ Details ]

Context and Target Configurations for Mining RDF Data. Abedjan, Ziawasch; Naumann, Felix (2011).

[ Details ]

Advancing the Discovery of Unique Column Combinations. Technical Report (51), Abedjan, Ziawasch; Naumann, Felix (2011).

[ Details ]

RDF Ontology (Re-)Engineering through Large-scale Data Mining. Lorey, Johannes; Abedjan, Ziawasch; Naumann, Felix; Böhm, Christoph (2011).

[ Details ]

Efficient and Exact Computation of Inclusion Dependencies for Data Integration. Technical Report (34), Bauckmann, Jana; Leser, Ulf; Naumann, Felix (2010).

[ Details ]

Graph-Based Ontology Construction from Heterogeneous Evidences. Böhm, Christoph; Groth, Philip; Leser, Ulf (2009). 91–96.

[ Details ]

A Machine Learning Approach to Foreign Key Discovery. Rostin, Alexandra; Albrecht, Oliver; Bauckmann, Jana; Naumann, Felix; Leser, Ulf (2009).

[ Details ]

Metanome - Data Profiling

Tool and Algorithms

Repeatability

Data

Algorithm Research

Tool Development

Projects within Metanome

Teaching Data Profiling

Publications

Chair

News

03.04.2024 | Congratulations to the EDBT Best Paper Award!

05.03.2024 | Another Paper marked as reproducible by pVLDB Reproducibility Committee

21.01.2024 | Paper accepted at W-NUT 2024

19.12.2023 | Congratulations Dr. Gerardo Vitagliano!

13.12.2023 | Two papers accepted at EDBT Conference 2024

Project highlights

People and open positions