Dustin Lange - iPopulator: Learning to Extract Structured Information from Wikipedia Articles to Populate Infoboxes
In this talk, we present iPopulator - a system that automatically populates infoboxes of Wikipedia articles by analyzing article texts. Wikipedia infoboxes provide a brief overview on the most important facts about an article's subject. Readers profit by well-populated infoboxes, since they can instantly gather interesting facts. Additionally, external applications that process infoboxes, such as DBpedia, benefit from complete infoboxes, as they create a large knowledge base from them.
The problem addressed by iPopulator is the population of existing infoboxes with as many correct attribute values as possible. We assume that an article text already contains a potentially incomplete infobox. To approach the infobox population problem, we implemented a system that exploits attribute value structures and applies machine learning techniques. We have developed an algorithm that determines the structure of the majority of values of an attribute. Using this structure, an attribute can be divided into parts, enabling the creation of a well-labeled training data set. Conditional Random Fields (CRFs) are applied to the training articles to learn extractors for all attributes in all infobox templates. These extractors are applied to the test articles, so that attribute value parts can be extracted that are merged into a result value according to the learned attribute value structure.
To evaluate iPopulator, we use Wikipedia articles that already have an infobox. Our system achieves its highest extraction results for attributes that frequently occur in article texts. In about two of three cases, the attribute value structure can successfully be reconstructed. By extracting infobox attribute values, the knowledge that is summarized in infoboxes is extended. Additionally, the exploitation of attribute value structures for the construction of attribute values results in a better data quality.