Prof. Dr. Felix Naumann

Large-Scale Data Analysis Tasks

Grading formula


where i is the current subtask, t_max,i is maximum runtime (over all teams) for this subtask, t_i individual team's runtime for this subtask, w_i is the weight of this subtask. If your algorithm solves n subtasks in one run, the total runtime is split up evenly into n different t_i. This may result in longer runtimes for one subtask compared to other teams, but should yield better results for other subtasks and a higher score overall. If it doesn't, your code should be revised...



(the lower the easier)

Count Triples
  • count the number of triples (or rather quadruples, as data is in NQuad format)
easy peasy
Cluster Datasets
  • identify datasets by URIs, per dataset:
    • identify dataset location (e.g. "http://dbpedia.org/")
    • identify good dataset sample resource (e.g. "http://dbpedia.org/Berlin")
    • identify URI regular expression patterns
    • suggest a textual description

clustering + text extraction + regexp detection 

Cluster Datasets II
  • identify datasets by different means than solely using the URI, per dataset:
    • identify good dataset sample resource (e.g. "http://dbpedia.org/Berlin")
    • identify URI regular expression patterns
    • suggest a textual description
up to 4, depending on "Wow"-Factor
clustering + text extraction + regexp detection
Identify vocabularies
  • identify RDF namespace of all predicates (e.g. "http://xmlns.com/foaf/0.1")

clustering of namespaces (similar to numberOfDistinctPredicates?) + regexp detection ("http://xmlns.com/foaf/0.1/name" => "http://xmlns.com/foaf/0.1" etc.)

Compute RDF statistics
  • identify numberOfDistinctSubjects, numberOfDistinctObjects, numberOfDistinctPredicates, numberOfDistinctContexts, numberOfResources
all similar to WordCount (except for when blank nodes occur)
Detect Linksets
  • identify linksets among datasets
  • count number of links within a linkset
once you have datasets, this is easy (same principle applied to precomputed sets)
Detect similar subjects/contexts (subjects and contexts are mostly identical)
  • for two subject/context combinations
    find all quadruples
    where  and
    but and
  • count the number k of identical and pairs (number of aforementioned quadruple combinations) and derive k-similarity of subjects/contexts (where k is number of pairs: the higher k is the more similar subjects/contexts are)
  • detect subjects/contexts which are at least 1-similar but which are not directly referenced, for example by
  • for k > 0 are there any k-similar and
    where or ?
5note that contexts are identical to subjects in most of the quadruples, but not all (cf. last subtask)