Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Master's Project: WikiWatch

General Information

Data Quality in Knowledge Graphs

Knowledge graphs (KGs) represent real-world objects and their relationships as nodes and edges. Wikidata is a popular and enormous general-purpose KG that is created and curated both by humans and machines [1]. As of 2023, its data dump holds 766 GB of data, with 103 million entities, 2.8 million classes, and 11,000 properties [7]. Wikidata's size and diversity in editors introduce different data quality (DQ) challenges [2], such as incomplete entity descriptions or classes that are difficult to distinguish but still appear separately, eg, “geographical location” vs. “location” vs. “physical location”. 

The Wikidata community has already developed different policies and approaches to uphold the quality of the data, such as property constraints, entity schemas, and showcase items [3]. Data quality (DQ) is typically measured along so-called DQ dimensions, such as accuracy, completeness, timeliness or understandability. Several tools have been developed to assess specific DQ dimensions in Wikidata [4, 5]. They support maintainers with the identification of certain issues for later correction, or editors when creating new content [6, 7]. However, no tool exists to assess and visualize how the quality of Wikidata evolves over time, which would help the community to measure the impact of their approaches and identify future needs.

Project Goals

In this project, we will develop a Python prototype to assess the quality of Wikidata over time. To achieve this, we want to (1) extract the history of changes for all entities in Wikidata, (2) efficiently calculate data quality measures for different dimensions, and (3) ultimately visualize how these measures evolve over time. To reach these goals, we have planned the following tasks:

  1. Extract the history of changes for all entities in Wikidata: In the first task, we will extract the history of changes for all entities in Wikidata from the available database dumps. This task presents several interesting data engineering challenges, including large-scale data processing and efficient handling of complex file structures. For example, the May 2025 Wikidata dump comprises ~1,400 compressed XML files, each averaging around 250 MB. This task also involves parsing snapshots of entities and identifying changes between revisions. Existing work [8] already explored how to model such changes; however, this approach should be adapted to align with the characteristics of Wikidata’s data model.
  2. Select a subset of entities: The second task involves selecting a subset of interesting entity-types to work with on the following tasks.
  3. Extract basic statistics from the dataset: The third task involves calculating different statistics from the selected subset, such as the number of revisions over time, the number of entities over time, the average number of properties per entities of a certain type over time, or the number of changes per property across all entities over time. These statistics should be visualized with plots.
  4. Calculate and store data quality measures over time: The fourth task involves calculating and storing data quality measures for different dimensions, such as completeness or timeliness. These measures should be calculated incrementally to reduce storage overhead.
  5. Develop a dashboard to visualize the development of data quality measures over time: The WikiWatch dashboard should include the statistics calculated in the third task, filters to select different entity types, and visualizations for the different dimensions, showing how the measures evolve over time.
  6. Prepare a submission to a top database conference. We plan to write and submit a paper with the results of this project. The students will be the main authors of this work.

Initial Related Work

1. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., & Auer, S. (2012). Quality assessment for Linked Data: A Survey. Semantic Web, 7(1), 63–93. https://doi.org/10.3233/SW-150175 

2. Hofer, M., Töpfer, M. M., Rost, C., & Rahm, E. (2025). DBpedia-TKG: Capturing Wikipedia’s Evolution as Temporal Knowledge Graphs. Lecture Notes in Computer Science, 15719 LNCS, 262–279. https://doi.org/10.1007/978-3-031-94578-6_15

3. Piscopo, A., & Simperl, E. (2019). What we talk about when we talk about Wikidata quality: A literature survey. Proceedings of the 15th International Symposium on Open Collaboration, OpenSym 2019. https://doi.org/10.1145/3306446.3340822 

4. Färber, M., Bartscherer, F., Menne, C., & Rettinger, A. (2018). Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web, 9(1), 77–129. https://doi.org/10.3233/SW-170275 

 

Contact

The project will run in collaboration with Wikimedia Germany, which is responsible for Wikidata globally. It will be supervised by Dr. Lisa Ehrlinger, Carolina Cortés Lasalle, and Prof. Felix Naumann at the Information Systems chair. If you have any questions, please do not hesitate to contact us directly.

References

1. "Wikidata: Introduction." Wikidata , https://www.wikidata.org/wiki/Wikidata:Introduction . Accessed June 6, 2025.

2. Farda-Sarbas, M., Sarasua C., Müller-Birn, C. & Bernstein A. (2019). Workshop on Data Quality Management in Wikidata . Retrieved June 6, 2025, from https://fardamariam.wixsite.com/wikidatadqworkshop

3. Piscopo, A., & Simperl, E. (2019). What we talk about when we talk about Wikidata quality: A literature survey. Proceedings of the 15th International Symposium on Open Collaboration, OpenSym 2019 . https://doi.org/10.1145/3306446.3340822

4. Balaraman, V., Razniewski, S., & Nutt, W. (2018). Recoin: Relative Completeness in Wikidata. The Web Conference 2018 - Companion of the World Wide Web Conference, WWW 2018 , 1787–1792. https://doi.org/10.1145/3184558.3191641 

5. Samuel, John. mlscores . GitHub, https://github.com/johnsamuelwrites/mlscores . Accessed June 10, 2025.

6. Amaral G, Rodrigues O, Simperl E (2023). ProVe: A pipeline for automated provenance verification of knowledge graphs against textual sources. Semantic Web. doi:10.3233/SW-233467

7. Pintscher, L., & Werkmeister, L. (2019, January 18). Overview of data quality tools on Wikidata [Presentation]. Workshop on Data Quality Management in Wikidata. https://docs.google.com/presentation/d/1rwjqzPaHTsXNNqDc2Op1-qSbcFyaFwOSnkEkStp5L3E/edit?slide=id.g15105b408d_0_287

8. Bleifuß, T., Bornemann, L., Johnson, T., Kalashnikov, D. v, Naumann, F., & Srivastava, D. (2018). Exploring Change-A New Dimension of Data Analytics. PVLDB , 12 (2), 2150–8097. https://doi.org/10.14778/3282495.3282496 

9. Suchanek, F., Alam, M., Bonald, T., Chen, L., Paris, P.-H., & Soria, J. (Eds.). (2024). YAGO 4.5: A Large and Clean Knowledge Base with a Rich Taxonomy. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . arxiv.org/html/2308.11884v2

Timetable

Our meetings are currently scheduled for TBD. In our first meeting, we will discuss on possible alternative times that are suitable for everyone.

The following timetable lists the main semester milestones and it still tentative.

Fecha

Tiempo

Habitación

Tema

Diapositivas

9.10.202513:00 - 14:00F-2.10Kickoff meetingSlides
21.10.202515:15 - 16:45F-2.10Session about "How to read a research paper"