Change Exploration
The Janus (IANVS) Project
Data change, all the time. In this project we want to explore and understand those changes. We call this activity change exploration: For a given, dynamic dataset, we want to efficiently capture and summarize changes at instance-, and schema-level, enable users to effectively explore this change in an interactive and graphical fashion and analyze patterns in the changing data.
The art of exploration is to preserve order amid change and to preserve change amid order. (adapted from Alfred North Whitehead)
Change-cube
We choose a generic model to represent changes to a dataset. It includes the following four dimensions to represent what changed where, when, and how:
- Time
- Entity (ID)
- Property
- Value
A change c is a quadruple of the form
<Time, ID, Property, Value> or in brief <t, id, p, v>.
Its semantics is: At time t the property p of the entity identified with id was created as or changed to v. A change-cube is a set of such changes. For more details on our data model see our vision paper at VLDB 2019 (see below).
Sources
- Code Repositories:
- Change Clustering Framework: Framework to cluster changes represented in a change cube
- IMDB Parser: Parser and Scraper for the data in the IMDB semi-structured text format (pre 2018)
- Natural Key Discovery in Wikipedia Tables: Supervised Learning Approach for the discovery of natural keys (entity identifiers) in relational Wikipedia tables.
- Matching Roles from Temporal Data: The complete CBRM framework to discover role matchings in temporal fact data is linked on the project page.
- Schema Change Recommendations: An algorithm to recommend schema changes for Wikipedia tables.
- Datasets:
- Matching Roles from Temporal Data: All datasets relevant for this project can be found on the project page.
- Structured Object Matching Across Web Page Revisions:
- Natural Keys in Wikipedia Table Histories: 1000 Wikipedia Table Histories with annotated natural keys
- All histories of relational Wikipedia tables with programmatically annotated natural keys: Due to its large size we provide this dataset only upon request.
- Extracted schema changes: See project page.
- Other relevant datasets
- Tools:
Team
- Project lead: Prof. Felix Naumann
- Doctoral researchers: Tobias Bleifuß and Leon Bornemann
- In collaboration with: Dmitri V. Kalashnikov, and Divesh Srivastava – AT&T Labs - Research
Former members
- Student assistant: Joana Bergsiek, Kshitij Kumar, Hung Nguyen
- Collaborators: Theodore Johnson – AT&T Labs - Research
Student projects
- Master project: Vandalism Detection in Wikipedia Table Revisions
- Bachelor project: Unit Testing Data for Machine Learning (with Amazon Research Berlin)
- Master project: Discovering Change Dependencies