Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

International Workshop on Quality in Databases (QDB) 2023

12th International Workshop on Quality in Databases at the 49th VLDB conference

August 28, 2023, Vancouver, Canada

Welcome

News

  • The proceddings of the workshop are available here as a part of the Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023)
  • The schedule of the workshop is online (Last update 23.8.2023)
  • QDB’23 will feature two invited keynotes by Renée Miller (Northeastern University) and Theodoros Rekatsinas (Apple, USA)
  • Deadline extended to June 14 due to various requests.
  • QDB’23 workshop proposal is accepted as VLDB workshop.

Quality in Databases

Data quality has been a major concern of organizations for decades. The recent advances in artificial intelligence (AI) have brought data quality (DQ) back into the spotlight: while many recent data quality and cleaning solutions are powered by ML, DQ is a core requirement to ensure reliable AI-based systems. DQ is tackled from different perspectives by different research communities, including database, machine learning (ML), and information systems. We believe it is important to bring together these communities to foster a vital discussion about the future of DQ assessment and improvement.

QDB'23 revives the successful QDB workshop series to cover the needs of the AI era, addressing both industry and academia (cf. data-centric AI). The workshop aims to (1) revive vital discussions about data quality, and (2) specifically exchange novel ideas and best practices about data quality assessment and improvement in the context of AI-based systems. 

Call for Papers

Topics of Interest

The focus is on new and practical methods for (semi-)automated (ML-based) data quality assessment and improvement. The topics of interest include, but are not limited to:

  • Data quality assessment for AI-based systems
  • Data quality improvement / data cleaning for AI-based systems
  • Data preprocessing and data preparation
  • Benchmark data sets to evaluate DQ assurance methods
  • Automation of DQ assessment and improvement methods
  • ML-powered methods for improving data quality
  • Data profiling for data quality measurement
  • Data quality in graph-structured or time-series data
  • Metadata management
  • Human-in-the-loop approaches for DQ
  •  Post-training quality / fact checking
  • Explainable data cleaning
  • Methods to scale data quality assessment and cleansing
  • FAIRness in data quality

Important Dates

Submission deadline 
(May 31, 2023, 9pm PST)
Extension to June 14, 2023, 9pm PST

Notification 
July 25, 2023

Final version
August 9, 2023

Workshop
August 28, 2023

Manuscript Preparation

Submission
Authors are invited to submit original, unpublished full research papers and demo descriptions that are not being considered for publication in any other forum.
Please submit your paper as a PDF using Microsoft's QDB CMT site. You need to append the category tag as a suffix to the title of the paper such as “Data Management in the Year 3000 [Regular]”; “Spatial Database System [Demo]”. This must be done both in the paper file and in the CMT submission title. The suffix will not be part of the camera-ready copy if the paper is accepted.

Format
It is the authors' responsibility to ensure that their submissions adhere to the VLDB format detailed here. In particular, it is not allowed to modify the format with the objective of squeezing in more material. Submissions that do not comply with the formatting detailed here will be rejected without review. Note that the limit of up to 6 pages (including all figures, tables, and references) must be followed for both full papers and demos.

Publication
Accepted papers will be distributed via the CEUR workshop proceedings.

Program

08:45 - 09:00IntroductionOrganizers
09:00 - 10:00Invited talk 1 : Renée J. Miller (Bio)Chair: Felix
 Semantic Version Management (abstract) 
10:00 - 10:30Coffee break 
10:30 - 12:00 Research Session 1 Chair:Hazar  
 Data Quality in Data Streams by Modular Change Point Detection
Yaron Kanza, Rajat Malik, Divesh Srivastava, Caroline M Stone and Gordon Woodhull
 (pdf, slides)
 On the Data Quality of Remotely Sensed Forest Maps
Shadi Ghasemitaheri, Amelia Holcomb, Lukasz Golab and Srinivasan Keshav
(pdf, slides)
 About the Effects of Data Imputation Techniques on ML Uncertainty 
Cinzia Cappiello, Federico Cerutti, Camilla Sancricca and Riccardo Zanelli
(pdf, slides)
 Data Quality and Data Ethics: Towards a Trade-off Evaluation
Fabio Azzalini, Cinzia Cappiello, Chiara Criscuolo, Camilla Sancricca and Letizia Tanca
(pdf, slides)
12:00 - 13:30Lunch 
13:30 - 14:30Invited talk 2 :Theodoros Rekatsinas (Bio)Chair: Ihab
 Is Data Management the Key to Successful AI Systems? (abstract 
14:30 - 15:00Research Session 2 Chair: Ihab
 A Plaque-Test for Redundancies in Relational Data 
Christoph Köhnen, Stefan Klessinger, Jens Zumbrägel and Stefanie Scherzinger
(pdf,slides)
15:00 - 15:30Coffee break 
15:30 - 17:00                     World Café Session (details)Chair: Lisa

Program Committee

Program Chairs

Lisa Ehrlinger (Software Competence Center Hagenberg GmbH, Austria)
Hazar Harmouch (Hasso Plattner Institute, University of Potsdam, Germany)
Ihab Ilyas (Apple, University of Waterloo, USA)
Felix Naumann (Hasso Plattner Institute, University of Potsdam, Germany)

Program Committee

Ziawasch Abedjan (Leibniz Universität Hannover, Germany)
Felix Bießmann (Berlin University of Applied Sciences and Technology, Germany)
Antoon Bronselaer (Ghent University, Belgium)
Ismael Caballero (University of Castilla La Mancha, Spain)
Cinzia Capiello (Politecnico di Milano, Italy)
Chang Ge (U Minnesota, USA)
Rihan Hai (TU Delft, Netherlands)
Christine Legner (University of Lausanne, Switzerland) 
Sebastian Link (University of Auckland, New Zealand)
Paolo Papotti (EURECOM, France)
Elizabeth Pierce (University of Little Rock, USA)
Erhard Rahm (Uni Leipzig, Germany)
Shazia Sadiq (The University of Queensland, Australia)
Kai-Uwe Sattler (TU Ilmenau, Germany)
Sebastian Schelter (University of Amsterdam, Netherlands)
Panos Vassiliadis (University of Ioannina, Greece)
John Talburt (University of Little Rock, USA)
Wolfram Wöß (Johannes Kepler University Linz, Austria)

Abstracts

Semantic Version Management
Renée Miller (Northeastern University)

Data science is by its nature collaborative and, as a result, multiple versions of the same dataset are generated as a by-product of most data science activities.  While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored.  In this talk, I will present our new project on Semantic Version Management in which we seek to lay the foundations for semantic understanding of data changes that result in multiple versions and introduce scalable tools to uncover and explain data changes.  Specifically,  I will briefly introduce Explain-Da-V [1], a framework that explains changes between two given dataset versions using data transformations.  I will then present open problems and challenges that include efficiently finding new tables that help explain a set of changes and discovering different versions of a dataset from within a massive table repository.

This is joint work with Professor Roee Shraga of the Worcester Polytechnic Institute.

[1] Roee Shraga, Renée J. Miller:  Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V. Proc. VLDB Endow. 16(6): 1587-1600 (2023)

 

Is Data Management the Key to Successful AI Systems?
Theodoros Rekatsinas (Apple, USA)

Reasoning over relational data presents unique challenges and opportunities in the context of modern AI. This talk will explore how vector spaces offer a promising solution to data quality and data analytics problems by providing a unified representation for relational data. We will briefly review machine-learning models for embedding relational data in vector spaces and then dive deeper into systems, algorithms, and architectural designs for scaling these machine-learning approaches to billion-scale data inputs. In this second part of the talk, I will introduce Marius (https://marius-project.org), a system for scaling modern AI models to billion-scale graphs and discuss novel learned indexing methods for enabling high-throughput inference over embedded relational data. I will conclude the talk by discussing how the above techniques have been applied in an industry setting.

Bios

Renée J. Miller is a University Distinguished Professor of Computer Science at Northeastern University.  She is a Fellow of the Royal Society of Canada, Canada’s National Academy of Science, Engineering and the Humanities. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Ontario Premier’s Research Excellence Award, and an IBM Faculty Award. She formerly held the Bell Canada Chair of Information Systems at the University of Toronto and is a fellow of the ACM. Her work has focused on the long-standing open problem of data integration and has achieved the goal of building practical data integration systems. She and her colleagues received the ICDT Test-of-Time Award and the 2020 Alonzo Church Alonzo Church Award for Outstanding Contributions to Logic and Computation for their influential work establishing the foundations of data exchange. For her body of work, she has received the CS Canada Lifetime Achievement Award in Computer Science.  Professor Miller is an Editor-in-Chief of the VLDB Journal and former president of the non-profit Very Large Data Base (VLDB) Foundation. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor’s degrees in Mathematics and Cognitive Science from MIT.

Theo Rekatsinas is a Research Engineer at Apple. Theo co-founded Inductiv (acquired by Apple), a company that developed AI solutions for identifying and correcting errors in data. Theo was also a Professor of Computer Science at ETH Zürich and the University of Wisconsin-Madison. Theo’s research focuses on scalable machine learning algorithms and systems over relational data. His research explores the fundamental connections between data preparation, data integration, and knowledge management with statistical machine learning and probabilistic inference. Theo holds PhD and Masters Degrees in Computer Science from the University of Maryland - College Park. He also holds a Bachelors and Masters Degree in Electrical Engineering from the National Technical University of Greece.

World Café Session

 Interactive Discussion on the Future of Data Quality

The aim of this session is to discuss the future of data quality research based on current challenges. We will use the world-café method for moderation, where 3-4 specific areas (e.g., benchmark datasets) are discussed at separate tables to establish the key challenges of that area. Each table will have a moderator. At the end of the session, we will summarize and present the outcome of each table and discuss key take-aways.

Past Events

We are building on an established tradition of eleven previous international VLDB workshops concerning data and information quality.