Abstract. Linked and other Open Data poses new challenges and op- portunities for the data mining community. Unfortunately, the large volume and great heterogeneity of available open data requires significant integration steps before it can be used in applications. A promising technique to explore such data is the use of association rule mining. We introduce two algorithms for enriching RDF data. The first application is a suggestion engine that is based on mining RDF predicates and supports manual statement creation by suggesting new predicates for a given entity. The second application is knowledge creation: Based on mining both predicates and objects, we are able to generate entirely new statements for a given data set without any external resources.
Paper at PROFILES2014
LODOP - Multi-Query Optimization for Linked Data Profiling Queries
Benedikt Forchhammer, Anja Jentzsch, Felix Naumann
Abstract. The Web of Data contains a large number of different, openly available datasets. In order to effectively integrate them into existing applications, meta information on statistical and structural properties is needed. Examples include information about cardinalities, value pat- terns, or co-occurring properties. For Linked Data sets such information is currently very limited or not available at all. Data profiling techniques are needed to compute respective statistics and meta information. However, current state of the art approaches can either not be applied to Linked Data, or exhibit considerable performance problems.
We present Lodop, a framework for computing, optimizing, and benchmarking data profiling techniques based on MapReduce with Apache Pig. We implemented 15 of the most important data profiling tasks, optimized their simultaneous execution, and evaluate them with four typical datasets from the Web of Data. Our optimizations focus on reducing the amount of MapReduce jobs and minimizing the communication overhead between multiple jobs. Our evaluation shows the significant potential in optimizing the runtime costs for Linked Data profiling.