Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

iPopulator

Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes.

With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values to independently extract value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.

 

Extracted Data

We ran iPopulator on the complete Wikipedia dump (as of December 2010). We could successfully extract many new infobox attribute values. In the following, we provide the extracted data in three formats:

  • Raw data: Contains list of tab-separated extraction data (article name, attribute, value) with raw values in MediaWiki syntax
  • CSV: Contains list of comma-separated triples (article name as subject, attribute as predicate, extracted value as object)
  • N3: Extracted triples (article name as subject, attribute as predicate, extracted value as object) in N3/Turtle syntax (a readable serialization format for RDF)

Note that while the raw data contains multi-values (e.g., a list of names as value for the attribute key_people in infobox_company), these values have been split-up into several triples for CSV and N3. For these two formats, corrupted links have been removed and all subjects, properties, and links in values have been transformed into resources or properties. In general, we use DBpedia resource and property URIs for our dataset. For clarity reasons, however, all additional resources and properties extracted that are not part of DBpedia use the namespace http://hpi-web.de/naumann/ipopulator.

iPopulator automatically evaluates its extraction performance using existing infobox attribute values as test data. This allows us to extract new values only for promising infobox attributes. We provide extracted data with three different levels of minimum extraction precision (based on the test data).

Extraction precision # extracted values # triples generated from extracted values Download
>= 80% 259,892 307,700 Raw CSV N3
>= 90% 149,150 198,529 Raw CSV N3
>= 95% 109,345 158,115 Raw CSV N3

The extracted data is provided for free use in any application. If you would like to use the data, we would be glad to hear about it. If you would like to cite our work, please refer to our CIKM paper [1].

Contact

If you have any questions or comments, please contact Dustin Lange.

Publications

Export BibTeX
1.
Dustin Lange, Christoph Böhm, Felix Naumann
In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM), pages 1661-1664, Toronto, Canada, 2010
2.
Dustin Lange, Christoph Böhm, Felix Naumann
Technical Report 38, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, 2010 ISBN 978-3-86956-081-6, ISSN 1613-5652
Export BibTeX