CIND Discovery on RDF Data
Repeatability - cINDs on RDF
This is a repeatability page for cIND discovery algorithms on RDF data. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations.
Algorithms
Conditional inclusion dependencies (CINDs) within RDF datasets are a valuable input to core data management tasks, such as query optimization, ontology reverse engineering, and knowledge discovery. Most CIND discovery algorithms focus on relational databases, where they generate conditions for the left-hand side of a partial IND. In RDF datasets, it is of particular importance to consider both left-hand side and right-hand side conditions. Moreover, RDF datasets are fundamentally different from relational databases w.r.t. their structure. Therefore, RDF CIND discovery algorithms have to be designed differently from their relational counterparts.
Datasets
The table below lists the datasets (including links1) that have been used in the evaluation of the above algorithm.
| Name | Size of NT format | Number of (distinct) triples |
|---|---|---|
| Countries | 795 KB | 5,563 |
| Diseasome | 13 MB | 72,445 |
| LUBM-12 | 17 MB | 103,104 |
| DrugBank | 102 MB | 517,023 |
| LinkedMDB | 870 MB | 6,148,121 |
| DB14-MPCE | 4.33 GB | 33,329,233 |
| DB14-PLE | 21.77 GB | 152,913,360 |
| Freebase | 398.1 GB | 3,000,673,968 |
1 Please note that datasets might be subject to change. The above table reflects the state of the datasets when we downloaded them in mid 2015.
2 Not a real-world dataset (generated).
Algorithmic Results
RDFind
In general, even small RDF datasets tend to contain an intractable amount of CINDs, most of which do not provide any value for applications, such as query optimization. For this reason, RDFind extracts only pertinent CINDs that (i) comprise sufficiently many entities (= CIND support) and (ii) that are not implied by any other pertinent CIND (= minimal cover). Furthermore, RDFind distinguishes a special class of CINDs (= association rules/ARs) that state "if a triple has value v1 in attribute a1, then it has value v2 in attribute a2". The following numbers reflect these pecularities of the RDFind algorithm. Find the details in our SIGMOD 2016 paper.
| 1 | 55 | 2,104,525 | 4,363 |
| 10 | 18 | 35,269 | 33 |
| 25 | 17 | 491 | 28 |
| 50 | 17 | 272 | 17 |
| 100 | 18 | 142 | 12 |
| 500 | 17 | 0 | 0 |
| Support threshold | Runtime [s] | #CINDs | #ARs |
|---|---|---|---|
| 10 | 22 | 12,541 | 940 |
| 25 | 20 | 1,220 | 93 |
| 50 | 21 | 288 | 22 |
| 100 | 20 | 67 | 15 |
| 500 | 19 | 15 | 3 |
| 1,000 | 19 | 6 | 2 |
| 10,000 | 18 | 0 | 0 |
| Support threshold | Runtime [s] | #CINDs | #ARs |
|---|---|---|---|
| 10 | 26 | 10,429 | 701 |
| 25 | 22 | 1,745 | 34 |
| 50 | 23 | 119 | 33 |
| 100 | 22 | 95 | 31 |
| 500 | 19 | 26 | 11 |
| 1,000 | 21 | 7 | 6 |
| 10,000 | 19 | 0 | 0 |
| Support threshold | Runtime [s] | #CINDs | #ARs |
|---|---|---|---|
| 10 | 76 | 242,970 | 2,190 |
| 25 | 54 | 6,980 | 761 |
| 50 | 36 | 2,342 | 322 |
| 100 | 30 | 984 | 156 |
| 500 | 26 | 180 | 32 |
| 1,000 | 27 | 129 | 21 |
| 10,000 | 24 | 1 | 1 |
| Support threshold | Runtime [s] | #CINDs | #ARs |
|---|---|---|---|
| 10 | 1,800 | 37,887,079 | 8,328 |
| 25 | 167 | 33,372 | 2,576 |
| 50 | 142 | 6,955 | 874 |
| 100 | 126 | 2,659 | 344 |
| 500 | 112 | 699 | 91 |
| 1,000 | 126 | 344 | 45 |
| 10,000 | 114 | 155 | 26 |
| Support threshold | Runtime [s] | #CINDs | #ARs |
|---|---|---|---|
| 25 | 885 | 120,721 | 18,558 |
| 50 | 468 | 37,689 | 7,815 |
| 100 | 402 | 12,215 | 3,599 |
| 500 | 309 | 1,014 | 606 |
| 1,000 | 317 | 375 | 259 |
| 10,000 | 224 | 26 | 20 |
| Support threshold | Runtime [s] | #CINDs | #ARs |
|---|---|---|---|
| 50 | 35,304 | 62,134 | 1,078,785 |
| 100 | 10,742 | 18,460 | 406,301 |
| 500 | 1,393 | 804 | 37246 |
| 1,000 | 1,070 | 167 | 13,799 |
| 10,000 | 785 | 1 | 514 |