CIND Discovery on RDF Data

Repeatability - cINDs on RDF

This is a repeatability page for cIND discovery algorithms on RDF data. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations.

Algorithms

Conditional inclusion dependencies (CINDs) within RDF datasets are a valuable input to core data management tasks, such as query optimization, ontology reverse engineering, and knowledge discovery. Most CIND discovery algorithms focus on relational databases, where they generate conditions for the left-hand side of a partial IND. In RDF datasets, it is of particular importance to consider both left-hand side and right-hand side conditions. Moreover, RDF datasets are fundamentally different from relational databases w.r.t. their structure. Therefore, RDF CIND discovery algorithms have to be designed differently from their relational counterparts.

GitHub

Datasets

The table below lists the datasets (including links¹) that have been used in the evaluation of the above algorithm.

Name	Size of NT format	Number of (distinct) triples
Countries	795 KB	5,563
Diseasome	13 MB	72,445
LUBM-1²	17 MB	103,104
DrugBank	102 MB	517,023
LinkedMDB	870 MB	6,148,121
DB14-MPCE	4.33 GB	33,329,233
DB14-PLE	21.77 GB	152,913,360
Freebase	398.1 GB	3,000,673,968

¹ Please note that datasets might be subject to change. The above table reflects the state of the datasets when we downloaded them in mid 2015.

² Not a real-world dataset (generated).

Algorithmic Results

RDFind

In general, even small RDF datasets tend to contain an intractable amount of CINDs, most of which do not provide any value for applications, such as query optimization. For this reason, RDFind extracts only pertinent CINDs that (i) comprise sufficiently many entities (= CIND support) and (ii) that are not implied by any other pertinent CIND (= minimal cover). Furthermore, RDFind distinguishes a special class of CINDs (= association rules/ARs) that state "if a triple has value v₁ in attribute a₁, then it has value v₂ in attribute a₂". The following numbers reflect these pecularities of the RDFind algorithm. Find the details in our SIGMOD 2016 paper.

#ARs

1	55	2,104,525	4,363
10	18	35,269	33
25	17	491	28
50	17	272	17
100	18	142	12
500	17	0	0

RDFind on Diseasome
Support threshold	Runtime [s]	#CINDs	#ARs
10	22	12,541	940
25	20	1,220	93
50	21	288	22
100	20	67	15
500	19	15	3
1,000	19	6	2
10,000	18	0	0

RDFind on LUBM-1
Support threshold	Runtime [s]	#CINDs	#ARs
10	26	10,429	701
25	22	1,745	34
50	23	119	33
100	22	95	31
500	19	26	11
1,000	21	7	6
10,000	19	0	0

RDFind on DrugBank
Support threshold	Runtime [s]	#CINDs	#ARs
10	76	242,970	2,190
25	54	6,980	761
50	36	2,342	322
100	30	984	156
500	26	180	32
1,000	27	129	21
10,000	24	1	1

RDFind on LinkedMDB
Support threshold	Runtime [s]	#CINDs	#ARs
10	1,800	37,887,079	8,328
25	167	33,372	2,576
50	142	6,955	874
100	126	2,659	344
500	112	699	91
1,000	126	344	45
10,000	114	155	26

RDFind on DB14-MPCE
Support threshold	Runtime [s]	#CINDs	#ARs
25	885	120,721	18,558
50	468	37,689	7,815
100	402	12,215	3,599
500	309	1,014	606
1,000	317	375	259
10,000	224	26	20

RDFind on DB14-MPCE
Support threshold	Runtime [s]	#CINDs	#ARs
50	35,304	62,134	1,078,785
100	10,742	18,460	406,301
500	1,393	804	37246
1,000	1,070	167	13,799
10,000	785	1	514