We ran iPopulator on the complete Wikipedia dump (as of December 2010). We could successfully extract many new infobox attribute values. In the following, we provide the extracted data in three formats:
- Raw data: Contains list of tab-separated extraction data (article name, attribute, value) with raw values in MediaWiki syntax
- CSV: Contains list of comma-separated triples (article name as subject, attribute as predicate, extracted value as object)
- N3: Extracted triples (article name as subject, attribute as predicate, extracted value as object) in N3/Turtle syntax (a readable serialization format for RDF)
Note that while the raw data contains multi-values (e.g., a list of names as value for the attribute key_people in infobox_company), these values have been split-up into several triples for CSV and N3. For these two formats, corrupted links have been removed and all subjects, properties, and links in values have been transformed into resources or properties. In general, we use DBpedia resource and property URIs for our dataset. For clarity reasons, however, all additional resources and properties extracted that are not part of DBpedia use the namespace http://hpi-web.de/naumann/ipopulator.
iPopulator automatically evaluates its extraction performance using existing infobox attribute values as test data. This allows us to extract new values only for promising infobox attributes. We provide extracted data with three different levels of minimum extraction precision (based on the test data).
Extraction precision | # extracted values | # triples generated from extracted values | Download |
>= 80% | 259,892 | 307,700 | Raw | CSV | N3 |
>= 90% | 149,150 | 198,529 | Raw | CSV | N3 |
>= 95% | 109,345 | 158,115 | Raw | CSV | N3 |
The extracted data is provided for free use in any application. If you would like to use the data, we would be glad to hear about it. If you would like to cite our work, please refer to our CIKM paper [1].
Contact
If you have any questions or comments, please contactDustin Lange.