Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Repeatability - cINDs on RDF

This is a repeatability page for cIND discovery algorithms on RDF data. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations.

Algorithms

Conditional inclusion dependencies (CINDs) within RDF datasets are a valuable input to core data management tasks, such as query optimization, ontology reverse engineering, and knowledge discovery. Most CIND discovery algorithms focus on relational databases, where they generate conditions for the left-hand side of a partial IND. In RDF datasets, it is of particular importance to consider both left-hand side and right-hand side conditions. Moreover, RDF datasets are fundamentally different from relational databases w.r.t. their structure. Therefore, RDF CIND discovery algorithms have to be designed differently from their relational counterparts.

AlgorithmLink
RDFindGitHub

Datasets

The table below lists the datasets (including links1) that have been used in the evaluation of the above algorithm.

NameSize of NT formatNumber of (distinct) triples
Countries795 KB5,563
Diseasome13 MB72,445
LUBM-1217 MB103,104
DrugBank

102 MB

517,023
LinkedMDB870 MB6,148,121
DB14-MPCE4.33 GB33,329,233
DB14-PLE21.77 GB152,913,360
Freebase398.1 GB3,000,673,968

 

1 Please note that datasets might be subject to change. The above table reflects the state of the datasets when we downloaded them in mid 2015.

2 Not a real-world dataset (generated).

Algorithmic Results

RDFind

In general, even small RDF datasets tend to contain an intractable amount of CINDs, most of which do not provide any value for applications, such as query optimization. For this reason, RDFind extracts only pertinent CINDs that (i) comprise sufficiently many entities (= CIND support) and (ii) that are not implied by any other pertinent CIND (= minimal cover). Furthermore, RDFind distinguishes a special class of CINDs  (= association rules/ARs) that state "if a triple has value v1 in attribute a1, then it has value v2 in attribute a2". The following numbers reflect these pecularities of the RDFind algorithm. Find the details in our SIGMOD 2016 paper.

 

RDFind on Countries
Support thresholdRuntime [s]#CINDs#ARs
1552,104,5254,363
101835,26933
251749128
501727217
1001814212
5001700
RDFind on Diseasome
Support thresholdRuntime [s]#CINDs#ARs
102212,541940
25201,22093
502128822
100206715
50019153
1,0001962
10,0001800
RDFind on LUBM-1
Support thresholdRuntime [s]#CINDs#ARs
102610,429701
25221,74534
502311933
100229531
500192611
1,0002176
10,0001900
RDFind on DrugBank
Support thresholdRuntime [s]#CINDs#ARs
1076242,9702,190
25546,980761
50362,342322
10030984156
5002618032
1,0002712921
10,0002411
RDFind on LinkedMDB
Support thresholdRuntime [s]#CINDs#ARs
101,80037,887,0798,328
2516733,3722,576
501426,955874
1001262,659344
50011269991
1,00012634445
10,00011415526
RDFind on DB14-MPCE
Support thresholdRuntime [s]#CINDs#ARs
25885120,72118,558
5046837,6897,815
10040212,2153,599
5003091,014606
1,000317375259
10,0002242620
RDFind on DB14-MPCE
Support thresholdRuntime [s]#CINDs#ARs
5035,30462,1341,078,785
10010,74218,460406,301
5001,39380437246
1,0001,07016713,799
10,0007851514