Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.

For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.

Please do not hesitate to reach out directly to us, if you cannot find a paper, slides, or other research artifacts.

iPopulator

Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes.

With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values to independently extract value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.

Extracted Data

We ran iPopulator on the complete Wikipedia dump (as of December 2010). We could successfully extract many new infobox attribute values. In the following, we provide the extracted data in three formats:

Raw data: Contains list of tab-separated extraction data (article name, attribute, value) with raw values in MediaWiki syntax
CSV: Contains list of comma-separated triples (article name as subject, attribute as predicate, extracted value as object)
N3: Extracted triples (article name as subject, attribute as predicate, extracted value as object) in N3/Turtle syntax (a readable serialization format for RDF)

Note that while the raw data contains multi-values (e.g., a list of names as value for the attribute key_people in infobox_company), these values have been split-up into several triples for CSV and N3. For these two formats, corrupted links have been removed and all subjects, properties, and links in values have been transformed into resources or properties. In general, we use DBpedia resource and property URIs for our dataset. For clarity reasons, however, all additional resources and properties extracted that are not part of DBpedia use the namespace http://hpi-web.de/naumann/ipopulator.

iPopulator automatically evaluates its extraction performance using existing infobox attribute values as test data. This allows us to extract new values only for promising infobox attributes. We provide extracted data with three different levels of minimum extraction precision (based on the test data).

Extraction precision	# extracted values	# triples generated from extracted values	Download
>= 80%	259,892	307,700	Raw	CSV	N3
>= 90%	149,150	198,529	Raw	CSV	N3
>= 95%	109,345	158,115	Raw	CSV	N3

The extracted data is provided for free use in any application. If you would like to use the data, we would be glad to hear about it. If you would like to cite our work, please refer to our CIKM paper [1].

Contact

If you have any questions or comments, please contact Dustin Lange.

Publications

Extracting structured information from Wikipedia articles to populate infoboxes. Lange, Dustin; Böhm, Christoph; Naumann, Felix (2010). 1661–1664.

[ Details ]

Extracting structured information from Wikipedia articles to populate infoboxes. Technical Report (38), Lange, Dustin; Böhm, Christoph; Naumann, Felix (2010).

[ Details ]

Chair

Prof. Dr. Felix Naumann

Information Systems

E-Mail: felix.naumann(at)hpi.de

Assistant: Diana Stephan

Office: Campus II, House F, F-2.01
Tel.: +49 (0)331 5509-280
E-Mail: office-naumann(at)hpi.de

To visit us, please see these directions.

News

Project highlights

Metanome: Big Data Profiling

Metis: Data Quality Assessment

Janus: Change exploration

KITQAR: AI and Data Quality

iPopulator

Extracted Data

Publications

Chair

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

01.11.2025 | Paper accepted at WOP@ISWC

29.09.2025 | Paper accepted at NeurIPS 2025

29.09.2025 | Paper accepted at SIGMOD 2026

09.07.2025 | Paper accepted in SIGMOD Record

Project highlights

People and open positions