Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher, and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand.

In our research projects we try to develop efficient and scalable dependency detection algorithms, both for relational data in the Metanome project and for RDF data in the ProLOD++ project. Please see the menu for more projects.

Current projects

  • Metanome: A framework and application for efficient profiling algorithms on large relational datasets
  • Janus: Project on data change exploration

Completed projects

  • ProLOD and ProLOD++: An interactive application to profile RDF data.
  • MetaCrate: A database for data profiles
  • Mining RDF data: synonym discovery, ontology alignment and data enrichment. 
  • Stratosphere data profiling: We are developing distributed data profiling algorithms for Stratosphere and other distributed processing platforms
  • SPIDER: An efficient algorithm to detect inclusion dependencies and foreign keys
  • BTC: The results of our two participations in the Billion Triples Challenges
  • XStruct: Automatically extract schemata from XML documents