Metacrate: Organize and Analyze Millions of Data Profiles
Sebastian Kruse, David Hahn, Marius Walter, Felix Naumann
Our demo proposal for Metacrate has been accepted for presentation at the International Conference on Information and Knowledge Management (CIKM), which takes place from November 6th to 10th, 2017, in Singapore. Metacrate is a system to store data profiling results, integrate them, and analyze them to serve several use cases, such as data anamnesis, data cleaning, or data integration. As such, Metacrate complements our data profiling tool Metanome, which produces data profiling results.
Databases are one of the great success stories in IT. However, they have been continuously increasing in complexity, hampering operation, maintenance, and upgrades. To face this complexity, sophisticated methods for schema summarization, data cleaning, information integration, and many more have been devised that usually rely on data profiles, such as data statistics, signatures, and integrity constraints. Such data profiles are often extracted by automatic algorithms, which entails various problems: The profiles can be unfiltered and huge in volume; different profile types require different complex data structures; and the various profile types are not integrated with each other.
We introduce Metacrate, a system to store, organize, and analyze data profiles of relational databases, thereby following the proven design of databases. In particular, we (i) propose a logical and a physical data model to store all kinds of data profiles in a scalable fashion; (ii) describe an analytics layer to query, integrate, and analyze the profiles efficiently; and (iii) implement on top a library of established algorithms to serve use cases, such as schema discovery, database refactoring, and data cleaning.