Demo at ISWC 2015
Exploring Linked Data Graph Structures
Abstract. Abstract. The true value of Linked Data becomes apparent when datasets are analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data profiling is especially challenging for Rdf data, the underlying data model on the Web of Data. In particular, graph analysis can be used to gain more insight into the data, induce schemas, or build indices. We present ProLOD++, a tool for various profiling and mining tasks and in particular its recent exten- sion GraphLOD, which offers Rdf graph analysis features. ProLOD++ features many interactive profiling results specific for open data, such as schema discovery for user-generated attributes, association rule discovery to uncover synonymous predicates, and key discovery along ontology hierarchies. GraphLOD enhances it with subgraph pattern mining, node degree distribution, component visualization and analysis, and more.
Paper at COLD 2015
Uniqueness, Density, and Keyness: Exploring Class Hierarchies
Abstract. The Web of Data contains a large number of openly-available datasets covering a wide variety of topics. In order to benefit from this massive amount of open data, e.g., to add value to an organization's internal data, such external datasets must be analyzed and understood already at the basic level of data types, uniqueness, constraints, value patterns, etc. For Linked Datasets and other Web data such meta information is currently quite limited or not available at all. Data profiling techniques are needed to compute respective statistics and meta information. Analyzing datasets along the vocabulary-defined taxonomic hierarchies yields further insights, such as the data distribution at different hierarchy levels, or possible mappings betweens vocabularies or datasets. In particular, key candidates for entities are difficult to find in light of the sparsity of property values on the Web of Data. To this end we introduce the concept of keyness and perform a comprehensive analysis of its expressiveness on multiple datasets.