Bleifuß, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Structured Object Matching across Web Page Revisions. InIEEE International Conference on Data Engineering (ICDE), 2021.
Alder, Nicolas, Tobias Bleifuß, Leon Bornemann, Felix Naumann, and Tim Repke. Ein Data Engineering Kurs für 10.000 Teilnehmer. Datenbank-Spektrum. 20(1), 2021. DOI:https://doi.org/10.1007/s13222-020-00354-8.
Im Januar und Februar 2020 boten wir auf der openHPI Plattform einen Massive Open Online Course (MOOC) mit dem Ziel an, Nicht-Fachleute in die Begriffe, Ideen, und Herausforderungen von Data Science einzuführen. In über hundert kleinen Kurseinheiten erläuterten wir über sechs Wochen hinweg ebenso viele Schlagworte. Wir berichten über den Aufbau des Kurses, unsere Ziele, die Interaktion mit den Teilnehmerinnen und Teilnehmern und die Ergebnisse des Kurses.
Bornemann, Leon, Tobias Bleifuß, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Natural Key Discovery in Wikipedia Tables. InProceedings of The World Wide Web Conference (WWW), pages 2789–2795, 2020.
Wikipedia is the largest encyclopedia to date. Scattered among its articles, there is an enormous number of tables that contain structured, relational information. In contrast to database tables, these webtables lack metadata, making it difficult to automatically interpret the knowledge they harbor. The natural key is a particularly important piece of metadata, which acts as a primary key and consists of attributes inherent to an entity. Determining natural keys is crucial for many tasks, such as information integration, table augmentation, or tracking changes to entities over time. To address this challenge, we formally define the notion of natural keys and propose a supervised learning approach to automatically detect natural keys in Wikipedia tables using carefully engineered features. Our solution includes novel features that extract information from time (a table’s version history) and space (other similar tables). On a curated dataset of 1,000 Wikipedia table histories, our model achieves 80% F-measure, which is at least 20% more than all related approaches. We use our model to discover natural keys in the entire corpus of Wikipedia tables and provide the dataset to the community to facilitate future research.
Dürsch, Falco, Axel Stebner, Fabian Windheuser, Maxi Fischer, Tim Friedrich, Nils Strelow, Tobias Bleifuß, Hazar Harmouch, Lan Jiang, Thorsten Papenbrock, and Felix Naumann. Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. InProceedings of the International Conference on Information and Knowledge Management (CIKM), pages 219–228, 2019.
Bleifuß, Tobias, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. DBChEx: Interactive Exploration of Data and Schema Change. InProceedings of the Conference on Innovative Data Systems Research (CIDR), 2019.
Bleifuß, Tobias, Leon Bornemann, Theodore Johnson, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. Exploring Change - A New Dimension of Data Analytics. Proceedings of the VLDB Endowment (PVLDB). 12(2):85-98, 2018.
Data and metadata in datasets experience many different kinds of change. Values are inserted, deleted or updated; rows appear and disappear; columns are added or repurposed, etc. In such a dynamic situation, users might have many questions related to changes in the dataset, for instance which parts of the data are trustworthy and which are not? Users will wonder: How many changes have there been in the recent minutes, days or years? What kind of changes were made at which points of time? How dirty is the data? Is data cleansing required? The fact that data changed can hint at different hidden processes or agendas: a frequently crowd-updated city name may be controversial; a person whose name has been recently changed may be the target of vandalism; and so on. We show various use cases that benefit from recognizing and exploring such change. We envision a system and methods to interactively explore such change, addressing the variability dimension of big data challenges. To this end, we propose a model to capture change and the process of exploring dynamic data to identify salient changes. We provide exploration primitives along with motivational examples and measures for the volatility of data. We identify technical challenges that need to be addressed to make our vision a reality, and propose directions of future work for the data management community.
Bornemann, Leon, Tobias Bleifuß, Dmitri Kalashnikov, Felix Naumann, and Divesh Srivastava. Data Change Exploration using Time Series Clustering. Datenbank-Spektrum. 18(2):1-9, 2018. DOI:https://doi.org/10.1007/s13222-018-0285-x.
Analysis of static data is one of the best studied research areas. However, data changes over time. These changes may reveal patterns or groups of similar values, properties, and entities. We study changes in large, publicly available data repositories by modelling them as time series and clustering these series by their similarity. In order to perform change exploration on real-world data we use the publicly available revision data of Wikipedia Infoboxes and weekly snapshots of IMDB. The changes to the data are captured as events, which we call change records. In order to extract temporal behavior we count changes in time periods and propose a general transformation framework that aggregates groups of changes to numerical time series of different resolutions. We use these time series to study different application scenarios of unsupervised clustering. Our explorative results show that changes made to collaboratively edited data sources can help find characteristic behavior, distinguish entities or properties and provide insight into the respective domains.
Bleifuß, Tobias, Sebastian Kruse, and Felix Naumann. Efficient Denial Constraint Discovery with Hydra. Proceedings of the VLDB Endowment (PVLDB). 11(3):311-323, 2017.
Denial constraints (DCs) are a generalization of many other integrity constraints (ICs) widely used in databases, such as key constraints, functional dependencies, or order dependencies. Therefore, they can serve as a unified reasoning framework for all of these ICs and express business rules that cannot be expressed by the more restrictive IC types. The process of formulating DCs by hand is difficult, because it requires not only domain expertise but also database knowledge, and due to DCs' inherent complexity, this process is tedious and error-prone. Hence, an automatic DC discovery is highly desirable: we search for all valid denial constraints in a given database instance. However, due to the large search space, the problem of DC discovery is computationally expensive. We propose a new algorithm Hydra, which overcomes the quadratic runtime complexity in the number of tuples of state-of-the-art DC discovery methods. The new algorithm's experimentally determined runtime grows only linearly in the number of tuples. This results in a speedup by orders of magnitude, especially for datasets with a large number of tuples. Hydra can deliver results in a matter of seconds that to date took hours to compute.
Bleifuß, Tobias, Theodore Johnson, Dmitri V. Kalashnikov, Felix Naumann, Vladislav Shkapenyuk, and Divesh Srivastava. Enabling Change Exploration (Vision). InProceedings of the Fourth International Workshop on Exploratory Search in Databases and the Web (ExploreDB), pages 1-3, 2017.
Data and metadata suffer many different kinds of change: values are inserted, deleted or updated, entities appear and disappear, properties are added or re-purposed, etc. Explicitly recognizing, exploring, and evaluating such change can alert to changes in data ingestion procedures, can help assess data quality, and can improve the general understanding of the dataset and its behavior over time. We propose a data model-independent framework to formalize such change. Our change-cube enables exploration and discovery of such changes to reveal dataset behavior over time.
Bleifuß, Tobias, Susanne Bülow, Johannes Frohnhofen, Julian Risch, Georg Wiese, Sebastian Kruse, Thorsten Papenbrock, and Felix Naumann. Approximate Discovery of Functional Dependencies for Large Datasets. InProceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1803-1812, 2016.
Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution space, so that even today’s fastest FD discovery algorithms are limited to small datasets only, due to long runtimes and high memory consumptions. To overcome this situation, we propose an approximate discovery strategy that sacrifices possibly little result correctness in return for large performance improvements. In particular, we introduce AID-FD, an algorithm that approximately discovers FDs within runtimes up to orders of magnitude faster than state-of-the-art FD discovery algorithms. We evaluate and compare our performance results with a focus on scalability in runtime and memory, and with measures for completeness, correctness, and minimality.