Bleifuss, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. The Secret Life of Wikipedia Tables. In Proceedings of the 2nd Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEAData), co-located with VLDB, 2021.
Tables on the web, such as those on Wikipedia, are not the static grid of values that they seem to be. Rather, they have a life of their own: they are created under certain circumstances and in certain webpage locations, they change their shape, they move, they grow, they shrink, their data changes, they vanish, and they re-appear. When users look at web tables or when scientists extract data from them, they are most likely not aware that behind each table lies a rich history. For this empirical paper, we have extracted, matched and analyzed the entire history of all 3.5 M tables on the English Wikipedia for a total of 53.8 M table versions. Based on this enormous dataset of public table histories, we provide various analysis results, such as statistics about lineage sizes, table positions, volatility, change intervals, schema changes, and their editors. Apart from satisfying curiosity, analyzing and understanding the change-behavior of web tables serves various use cases, such as identifying out-of-date values, recognizing systematic changes across tables, and discovering change dependencies.