[1]Bornemann, Leon, Tobias Bleifuß, Dmitri V. Kalashnikov, Fatemeh Nargesian, Felix Naumann, and Divesh Srivastava. Matching Roles from Temporal Data: Why Joe Biden is Not Only President, but Also Commander-in-Chief. Proceedings of the ACM on Management of Data (PACMMOD). 1(1):1–26, 2023. DOI:https://doi.org/10.1145/3588919.
We present role matching, a novel, fine-grained integrity constraint on temporal fact data, i.e., (subject, predicate, object, timestamp)-quadruples. A role is a combination of subject and predicate and can be associated with different objects as the real world evolves and the data changes over time. A role matching states that the associated object of two or more roles should always match across time. Once discovered, role matchings can serve as integrity constraints to improve data quality, for instance of structured data in Wikipedia[3]. If violated, role matchings can alert data owners or editors and thus allow them to correct the error. Finding all role matchings is challenging due both to the inherent quadratic complexity of the matching problem and the need to identify true matches based on the possibly short history of the facts observed so far.To address the first challenge, we introduce several blocking methods both for clean and dirty input data. For the second challenge, the matching stage, we show how the entity resolution method Ditto[27] can be adapted to achieve satisfactory performance for the role matching task. We evaluate our method on datasets from Wikipedia infoboxes, showing that our blocking approaches can achieve 95% recall, while maintaining a reduction ratio of more than 99.99%, even in the presence of dirty data. In the matching stage, we achieve a macro F1-score of 89% on our datasets, using automatically generated labels.
Weitere Informationen
AbstractWe present role matching, a novel, fine-grained integrity constraint on temporal fact data, i.e., (subject, predicate, object, timestamp)-quadruples. A role is a combination of subject and predicate and can be associated with different objects as the real world evolves and the data changes over time. A role matching states that the associated object of two or more roles should always match across time. Once discovered, role matchings can serve as integrity constraints to improve data quality, for instance of structured data in Wikipedia[3]. If violated, role matchings can alert data owners or editors and thus allow them to correct the error. Finding all role matchings is challenging due both to the inherent quadratic complexity of the matching problem and the need to identify true matches based on the possibly short history of the facts observed so far.To address the first challenge, we introduce several blocking methods both for clean and dirty input data. For the second challenge, the matching stage, we show how the entity resolution method Ditto[27] can be adapted to achieve satisfactory performance for the role matching task. We evaluate our method on datasets from Wikipedia infoboxes, showing that our blocking approaches can achieve 95% recall, while maintaining a reduction ratio of more than 99.99%, even in the presence of dirty data. In the matching stage, we achieve a macro F1-score of 89% on our datasets, using automatically generated labels.
[2]Barth, Malte, Tibor Bleidt, Martin Büßemeyer, Fabian Heseding, Niklas Köhnecke, Tobias Bleifuß, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Detecting Stale Data in Wikipedia Infoboxes. In Proceedings of the International Conference on Extending Database Technology (EDBT), 2023.
Today’s fast-paced society is increasingly reliant on correct and up-to-date data. Wikipedia is the world’s most popular source of knowledge, and its infoboxes contain concise semi-structured data with important facts about a page’s topic. However, these data are not always up-to-date: we do not expect Wikipedia editors to update items at the moment their true values change. Also, many pages might not be well maintained and users might forget to update the data, e.g., when they are on holiday. To detect stale data in Wikipedia infoboxes, we combine correlation-based and rule-based approaches trained on different temporal granularities, based on all infobox changes over 15 years of English Wikipedia. We are able to predict 8.19 % of all changes with a precision of 89.69 % over a whole year, thus meeting our target precision of 85 % as suggested by the Wikimedia Foundation. These results can be used to mark potentially stale information on Wikipedia (on average 3,362 fields per week) for readers and to request an update by community contributors.
Weitere Informationen
AbstractToday’s fast-paced society is increasingly reliant on correct and up-to-date data. Wikipedia is the world’s most popular source of knowledge, and its infoboxes contain concise semi-structured data with important facts about a page’s topic. However, these data are not always up-to-date: we do not expect Wikipedia editors to update items at the moment their true values change. Also, many pages might not be well maintained and users might forget to update the data, e.g., when they are on holiday. To detect stale data in Wikipedia infoboxes, we combine correlation-based and rule-based approaches trained on different temporal granularities, based on all infobox changes over 15 years of English Wikipedia. We are able to predict 8.19 % of all changes with a precision of 89.69 % over a whole year, thus meeting our target precision of 85 % as suggested by the Wikimedia Foundation. These results can be used to mark potentially stale information on Wikipedia (on average 3,362 fields per week) for readers and to request an update by community contributors.
[3]Lindner, Daniel, Franziska Schumann, Nicolas Alder, Tobias Bleifuß, Leon Bornemann, and Felix Naumann. Mining Change Rules. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 91–103, 2022.
Changes in data happen frequently, and discovering how the changes interrelate can reveal information about the data and the transactions on them. In this paper, we define change rules as recurring patterns in database changes. Change rules embody valuable metadata and reveal semantic as well as functional relationships between versions of data. We can use change rules to discover formerly unknown relationships, anticipate data changes and explore anomalies if changes do not occur as expected. We propose the CR-Miner algorithm, which dispenses the manual formulation of rules to uncover this hidden knowledge in a generic and domain-independent way. Given a dataset together with its past versions, we efficiently discover change rules and rank them according to their potential for a manual review. The experimental results confirm that our method finds change rules efficiently in big data: On a subset of Wikipedia infoboxes encompassing data from four years and different categories, we discover 4 456 change rules. Rules between changes from 48 706 tables of open-government data observed over the period of one year can be discovered within 33 minutes, and rules between about 2.5 million Wikipedia infoboxes from 153 templates within 77 minutes.
Weitere Informationen
AbstractChanges in data happen frequently, and discovering how the changes interrelate can reveal information about the data and the transactions on them. In this paper, we define change rules as recurring patterns in database changes. Change rules embody valuable metadata and reveal semantic as well as functional relationships between versions of data. We can use change rules to discover formerly unknown relationships, anticipate data changes and explore anomalies if changes do not occur as expected. We propose the CR-Miner algorithm, which dispenses the manual formulation of rules to uncover this hidden knowledge in a generic and domain-independent way. Given a dataset together with its past versions, we efficiently discover change rules and rank them according to their potential for a manual review. The experimental results confirm that our method finds change rules efficiently in big data: On a subset of Wikipedia infoboxes encompassing data from four years and different categories, we discover 4 456 change rules. Rules between changes from 48 706 tables of open-government data observed over the period of one year can be discovered within 33 minutes, and rules between about 2.5 million Wikipedia infoboxes from 153 templates within 77 minutes.
[4]Bleifuss, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. The Secret Life of Wikipedia Tables. In Proceedings of the 2nd Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEAData), co-located with VLDB, 2021.
Tables on the web, such as those on Wikipedia, are not the static grid of values that they seem to be. Rather, they have a life of their own: they are created under certain circumstances and in certain webpage locations, they change their shape, they move, they grow, they shrink, their data changes, they vanish, and they re-appear. When users look at web tables or when scientists extract data from them, they are most likely not aware that behind each table lies a rich history. For this empirical paper, we have extracted, matched and analyzed the entire history of all 3.5 M tables on the English Wikipedia for a total of 53.8 M table versions. Based on this enormous dataset of public table histories, we provide various analysis results, such as statistics about lineage sizes, table positions, volatility, change intervals, schema changes, and their editors. Apart from satisfying curiosity, analyzing and understanding the change-behavior of web tables serves various use cases, such as identifying out-of-date values, recognizing systematic changes across tables, and discovering change dependencies.
Weitere Informationen
AbstractTables on the web, such as those on Wikipedia, are not the static grid of values that they seem to be. Rather, they have a life of their own: they are created under certain circumstances and in certain webpage locations, they change their shape, they move, they grow, they shrink, their data changes, they vanish, and they re-appear. When users look at web tables or when scientists extract data from them, they are most likely not aware that behind each table lies a rich history. For this empirical paper, we have extracted, matched and analyzed the entire history of all 3.5 M tables on the English Wikipedia for a total of 53.8 M table versions. Based on this enormous dataset of public table histories, we provide various analysis results, such as statistics about lineage sizes, table positions, volatility, change intervals, schema changes, and their editors. Apart from satisfying curiosity, analyzing and understanding the change-behavior of web tables serves various use cases, such as identifying out-of-date values, recognizing systematic changes across tables, and discovering change dependencies.
[5]Bleifuß, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Structured Object Matching across Web Page Revisions. In IEEE International Conference on Data Engineering (ICDE), pages 1284–1295, 2021.
A considerable amount of useful information on the web is (semi-)structured, such as tables and lists. An extensive corpus of prior work addresses the problem of making these human-readable representations interpretable by algorithms. Most of these works focus only on the most recent snapshot of these web objects. However, their evolution over time represents valuable information that has barely been tapped, enabling various applications, including visual change exploration and trust assessment. To realize the full potential of this information, it is critical to match such objects across page revisions. In this work, we present novel techniques that match tables, infoboxes and lists within a page across page revisions. We are, thus, able to extract the evolution of structured information in various forms from a long series of web page revisions. We evaluate our approach on a representative sample of pages and measure the number of correct matches. Our approach achieves a significant improvement in object matching over baselines and over related work.
Weitere Informationen
AbstractA considerable amount of useful information on the web is (semi-)structured, such as tables and lists. An extensive corpus of prior work addresses the problem of making these human-readable representations interpretable by algorithms. Most of these works focus only on the most recent snapshot of these web objects. However, their evolution over time represents valuable information that has barely been tapped, enabling various applications, including visual change exploration and trust assessment. To realize the full potential of this information, it is critical to match such objects across page revisions. In this work, we present novel techniques that match tables, infoboxes and lists within a page across page revisions. We are, thus, able to extract the evolution of structured information in various forms from a long series of web page revisions. We evaluate our approach on a representative sample of pages and measure the number of correct matches. Our approach achieves a significant improvement in object matching over baselines and over related work.
[6]Alder, Nicolas, Tobias Bleifuß, Leon Bornemann, Felix Naumann, and Tim Repke. Ein Data Engineering Kurs für 10.000 Teilnehmer. Datenbank-Spektrum. 20(1):5–9, 2021. DOI:https://doi.org/10.1007/s13222-020-00354-8.
Im Januar und Februar 2020 boten wir auf der openHPI Plattform einen Massive Open Online Course (MOOC) mit dem Ziel an, Nicht-Fachleute in die Begriffe, Ideen, und Herausforderungen von Data Science einzuführen. In über hundert kleinen Kurseinheiten erläuterten wir über sechs Wochen hinweg ebenso viele Schlagworte. Wir berichten über den Aufbau des Kurses, unsere Ziele, die Interaktion mit den Teilnehmerinnen und Teilnehmern und die Ergebnisse des Kurses.
Weitere Informationen
AbstractIm Januar und Februar 2020 boten wir auf der openHPI Plattform einen Massive Open Online Course (MOOC) mit dem Ziel an, Nicht-Fachleute in die Begriffe, Ideen, und Herausforderungen von Data Science einzuführen. In über hundert kleinen Kurseinheiten erläuterten wir über sechs Wochen hinweg ebenso viele Schlagworte. Wir berichten über den Aufbau des Kurses, unsere Ziele, die Interaktion mit den Teilnehmerinnen und Teilnehmern und die Ergebnisse des Kurses.
[7]Bornemann, Leon, Tobias Bleifuß, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Natural Key Discovery in Wikipedia Tables. In Proceedings of The World Wide Web Conference (WWW), pages 2789–2795, 2020.
Wikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time. To address this challenge, we formally define the notion of natural keys and propose a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features. Our solution includes novel features that extract information from time (a table’s version history) and space (other similar tables). On a curated dataset of 1,000 Wikipedia table histories, our model achieves 80% F-measure, which is at least 20% more than all related approaches. We use our model to discover natural keys in the entire corpus of Wikipedia tables and provide the dataset to the community to facilitate future research.
Weitere Informationen
AbstractWikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time. To address this challenge, we formally define the notion of natural keys and propose a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features. Our solution includes novel features that extract information from time (a table’s version history) and space (other similar tables). On a curated dataset of 1,000 Wikipedia table histories, our model achieves 80% F-measure, which is at least 20% more than all related approaches. We use our model to discover natural keys in the entire corpus of Wikipedia tables and provide the dataset to the community to facilitate future research.
[8]Bleifuß, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. DBChEx: Interactive Exploration of Data and Schema Change. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2019.
Data exploration is a visually-driven process that is often used as a first step to decide which aspects of a dataset are worth further investigation and analysis. It serves as an important tool to gain a first understanding of a dataset and to generate hypotheses. While there are many tools for exploring static datasets, dynamic datasets that change over time still lack effective exploration support. To address this shortcoming, we present our innovative tool Database Change Explorer (DBChEx) that enables exploration of data and schema change through a set of exploration primitives. Users gain valuable insights into data generation processes and data or schema evolution over time by a mix of serendipity and guided investigation. The tool is a server-client application with a web front-end and an underlying database that stores the history of changes in the data and schema in a data model called the change-cube. Our demonstration of DBChEx shows how users can interactively explore data and schema change in two real-world datasets, IMDB and Wikipedia infoboxes.
Weitere Informationen
AbstractData exploration is a visually-driven process that is often used as a first step to decide which aspects of a dataset are worth further investigation and analysis. It serves as an important tool to gain a first understanding of a dataset and to generate hypotheses. While there are many tools for exploring static datasets, dynamic datasets that change over time still lack effective exploration support. To address this shortcoming, we present our innovative tool Database Change Explorer (DBChEx) that enables exploration of data and schema change through a set of exploration primitives. Users gain valuable insights into data generation processes and data or schema evolution over time by a mix of serendipity and guided investigation. The tool is a server-client application with a web front-end and an underlying database that stores the history of changes in the data and schema in a data model called the change-cube. Our demonstration of DBChEx shows how users can interactively explore data and schema change in two real-world datasets, IMDB and Wikipedia infoboxes.
[9]Bleifuß, Tobias, Leon Bornemann, Theodore Johnson, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Exploring Change - A New Dimension of Data Analytics. Proceedings of the VLDB Endowment (PVLDB). 12(2):85–98, 2018.
Data and metadata in datasets experience many different kinds of change. Values are inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
Weitere Informationen
AbstractData and metadata in datasets experience many different kinds of change. Values are inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
[10]Bornemann, Leon, Tobias Bleifuß, Dmitri Kalashnikov, Felix Naumann, and Divesh Srivastava. Data Change Exploration using Time Series Clustering. Datenbank-Spektrum. 18(2):1–9, 2018. DOI:https://doi.org/10.1007/s13222-018-0285-x.
Analysis of static data is one of the best studied research areas. However, data changes over time. These changes may reveal patterns or groups of similar values, properties, and entities. We study changes in large, publicly available data repositories by modelling them as time series and clustering these series by their similarity. In order to perform change exploration on real-world data we use the publicly available revision data of Wikipedia Infoboxes and weekly snapshots of IMDB. The changes to the data are captured as events, which we call change records. In order to extract temporal behavior we count changes in time periods and propose a general transformation framework that aggregates groups of changes to numerical time series of different resolutions. We use these time series to study different application scenarios of unsupervised clustering. Our explorative results show that changes made to collaboratively edited data sources can help find characteristic behavior, distinguish entities or properties and provide insight into the respective domains.
Weitere Informationen
HerausgeberMichel, Sebastian and Gemulla, Rainer and Schenkel, Ralf
AbstractAnalysis of static data is one of the best studied research areas. However, data changes over time. These changes may reveal patterns or groups of similar values, properties, and entities. We study changes in large, publicly available data repositories by modelling them as time series and clustering these series by their similarity. In order to perform change exploration on real-world data we use the publicly available revision data of Wikipedia Infoboxes and weekly snapshots of IMDB. The changes to the data are captured as events, which we call change records. In order to extract temporal behavior we count changes in time periods and propose a general transformation framework that aggregates groups of changes to numerical time series of different resolutions. We use these time series to study different application scenarios of unsupervised clustering. Our explorative results show that changes made to collaboratively edited data sources can help find characteristic behavior, distinguish entities or properties and provide insight into the respective domains.