We are happy to announce that our paper "Natural Key Discovery in Wikipedia Tables" has been accepted as a short paper at The Web Conference 2020.
Authors: Leon Bornemann, Tobias Bleifuß, Dmitri V. Kalashnikov, Felix Naumann and Deepak Srivastava
Abstract: Wikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time.
To address this challenge, we formally define the notion of natural keys and propose a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features. Our solution includes novel features that extract information from time (a table's version history) and space (other similar tables). On a curated dataset of 1,000 Wikipedia table histories, our model achieves 80% F-measure, which is at least 20% more than all related approaches. We use our model to discover natural keys in the entire corpus of Wikipedia tables and provide the dataset to the community to facilitate future research.