Exploring Linked Data Graph Structures
The true value of Linked Data becomes apparent when datasets are analyzed and understood already at the basic level of data types, constraints, value patterns etc. Such data profiling is especially challenging for RDF data, the underlying data model on the Web of Data. In particular, graph analysis can be used to gain more insight into the data, induce schemas, or build indices. Graph patterns are of interest to many communities, e.g. for protein structures, network traffic, crime detection, and modeling object-oriented data.
We extended our RDF profiling framework ProLod++ to allow exploring the graphical structures of Linked Datasets by visualizing the connected components and the frequent graph patterns mined from them. Given the underlying graph for a Linked Dataset, containing all entities as nodes and object properties between them as links, we detect graph patterns for its directed as well as undirected version. Bigger graph components are mined for subgraph patterns using three different approaches: gSpan, GRAMI, and a new approach that mines for predefined patterns. Our goal is to define a set of graph patterns that can be considered the core of most Linked Datasets. We identify graph patterns such as paths, cycles, stars, siamese stars, antennas, caterpillars, and lobsters.
Patterns are grouped when isomorphic, first based on their underlying structure and then based on the class membership (color). This allows for finding not only common, re-occurring patterns but also patterns that are dominant for certain class combinations. E.g., astronomers in DBpedia are often to be found in star patterns, surrounded by their discovered astronomical objects. Based on the graph features and patterns profiled, an overall model for Linked Datasets can be derived.