Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

International Workshop on Quality in Databases (QDB) 2024

13th International Workshop on Quality in Databases at the 50th VLDB conference

August 26, 2024, Guangzhou, China

Welcome

News

  • The workshop program is online (last update 20.8.2024).
  • Paper notifications are out.
  • Submission deadline was extended upon request to June 14. Submit your paper via CMT here.
  • Quanqing Xu (Senior Researcher at Oceanbase) will share his experience on data quality in an industry talk at QDB'24.
  • QDB'24 will feature an invited keynote by Sebastian Schelter (TU Berlin) to talk about his latest research on data quality. 
  • QDB’24 workshop proposal is accepted as VLDB workshop.

Quality in Databases

Data quality has been a major concern of organizations for decades. The recent advances in artificial intelligence (AI) have brought data quality (DQ) back into the spotlight: while many recent data quality and cleaning solutions are powered by ML, DQ is a core requirement to ensure reliable AI-based systems. DQ is tackled from different perspectives by different research communities, including database, machine learning (ML), and information systems. We believe it is important to bring together these communities to foster a vital discussion about the future of DQ assessment and improvement.

Considering the large number of participants (>50) at QDB’23, QDB'24 aims to (1) continue to host the vital discussions about data quality, and (2)  exchange best practices and novel methods for (semi-)automated (ML-based) data quality assessment and improvement in the context of AI-based systems.

Program

09:00-09:15OpeningLisa & Hazar
09:15-10:30Research Session 1Chair: Lisa
 Accelerating the Data Cleaning Systems Raha and Baran through Task and Data Parallelism
Fatemeh Ahmadi, Yusuf Mandirali, Ziawasch Abedjan
 
 Valuation-based Data Acquisition for Machine Learning Fairness
Ekta and Romila Pradhan
 
 AutoFAIR : Automatic Data FAIRification via Machine Reading
Tingyan Ma, Wei Liu, Bin Lu, Xiaoying Gan, Yunqiang Zhu, Luoyi Fu, Chenghu Zhou
 
10:30 - 11:00Coffee break 
11:00 - 12:30 Keynote + 1 Research PaperChair: Hazar  
 Invited talk: Sebastian Schelter (bio)
How Data Management Research Helps to Improve Real World ML Applications (abstract)
 
 

Compute Engine Testing with Privacy-Compliant Production-Like Synthetic Data
Yu Liu, Jiangnan Cheng, Steve Chuck, Lyublena Antova, Yurgis  Baykshtis,  Matt  David, Ge Gao, Mehrdad  Honarkhah, Kuan-Sung  Huang, Chen-Kuei Lee, Usman  Muhammad, Shihao  Peng, Andrii  Rosa, Rebecca Schlussel, Michael Shang, Kelvin  Silva, Brandon Vo, Zac Wen, Yihao Zhou

 
12:30 - 14:00Lunch 
14:00 - 15:30Industry SessionChair: Sourav
 Industry talk by Quanqing Xu (Oceanbase)
Industry talk by Divesh Srivastava (AT&T Labs)
 
 Panel discussion with Quangqing Xu, Divesh Srivastava, and Fatma Ozcan  
15:30 - 16:00Coffee break 
16:00 - 18:15Research Session 2 Chair: Hazar
 Process Model-based Access Control Policies for Cross-Organizational Data Sharing
Liam Tirpitz, Leon Gentges
 
 Tracking Consistency over Data Streams with InkStream [Demo]
Samuele Langhi. Angela Bonifati. Riccardo Tommasini
 
 A Data Generator to Explore the Interactions Between Concept Drifts and Anomalies [Demo]
Jongjun Park, Akanksha Nehete, Tammy Zeng, Fei Chiang
 
 Towards Semi-Supervised Data Quality Detection In Graphs
Rubab Zahra Sarfraz
 
18:15-18:30ClosingChairs

Program Committee

Program Chairs

Sourav S Bhowmick (Nanyang Technological University, Singapore)
Lisa Ehrlinger (Hasso Plattner Institute, University of Potsdam, Germany)
Hazar Harmouch (University of Amsterdam, Netherlands)

Steering Committee

Ihab Ilyas (Apple, University of Waterloo, USA)
Felix Naumann (Hasso Plattner Institute, University of Potsdam, Germany)

Program Committee

Ziawasch Abedjan (TU Berlin, Germany)
Antoon Bronselaer (Ghent University, Belgium)
Felix Biessmann (Einstein Center Digital Future, Germany)
Ismael Caballero (University of Castilla La Mancha, Spain)
Cinzia Capiello (Politecnico di Milano, Italy)
Chang Ge (University of Minnesota, USA)
Christine Legner (University of Lausanne, Switzerland) 
Sebastian Link (University of Auckland, New Zealand)
Elizabeth Pierce (University of Little Rock at Arkansas, USA)
Kai-Uwe Sattler (TU Ilmenau, Germany)
Sebastian Schelter (University of Amsterdam, Netherlands)
John Talburt (University of Little Rock at Arkansas, USA)
Panos Vassiliadis (University of Ioannina, Greece)
Wolfram Wöß (Johannes Kepler University Linz, Austria)

Keynote

Title: How Data Management Research Helps to Improve Real World ML Applications 

Abstract: The talk will given an overview of our past and recent research to improve data quality in ML applications, based on proven principles and techniques from data management. In particular, we will cover work on declarative data unit tests tailored for large-scale data lakes, on reasoning about the datasets for ML applications by treating ML pipelines as algebraic queries, and on leveraging fine-grained data provenance as a foundation for data debugging systems.

Sebastian Schelter is a Full Professor at the Berlin Institute on the Foundations of Learning and Data (BIFOLD) and Technische Universität Berlin. His research is focused on the intersection of data management and machine learning with the goal to foster the responsible management of data and to democratise data science technologies. The research of his group is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms. In the past, he has been an assistant professor at the University of Amsterdam, a faculty fellow at New York University, a senior applied scientist at Amazon Research and a research intern at Twitter and IBM Almaden in California. His research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award, and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS.

Call for Papers

Topics of Interest

The focus is on new and practical methods for (semi-)automated (ML-based) data quality assessment and improvement. The topics of interest include, but are not limited to:

  • Data preprocessing
  •  Data profiling for data quality measurement
  • Explainable data cleaning
  • DQ requirements for generative AI systems
  • DQ using generative AI
  • Data quality assessment for AI-based systems
  • Data quality improvement / data cleaning for AI-based systems
  • Benchmark data sets to evaluate DQ assurance methods
  • Automation of DQ assessment and improvement methods
  •  Methods to scale data quality assessment and cleansing
  • ML-powered methods for improving data quality
  • Data quality in graph-structured or time-series data
  • Metadata management to improve data quality
  • Data quality in different data science domains
  •  Human-in-the-loop approaches for DQ
  • Post-training quality / fact checking
  • FAIRness in data quality

Important Dates

Submission deadline 
(May 31, 2024, 9pm PST)
Extension to June 14, 2024, 9pm PST

Notification 
July 22, 2024

Final version
August 5, 2024

Workshop
August 26, 2024

Manuscript Preparation

Submission
Authors are invited to submit original, unpublished full research papers and demo descriptions that are not being considered for publication in any other forum.
Please submit your paper as a PDF using Microsoft's QDB CMT site. You need to append the category tag as a suffix to the title of the paper such as “Data Management in the Year 3000 [Regular]”; “Spatial Database System [Demo]”. This must be done both in the paper file and in the CMT submission title. The suffix will not be part of the camera-ready copy if the paper is accepted.

Format
It is the authors' responsibility to ensure that their submissions adhere to the VLDB format detailed here. In particular, it is not allowed to modify the format with the objective of squeezing in more material. Submissions that do not comply with the formatting detailed here will be rejected without review. Note that the limit of up to 6 pages (including all figures, tables, and references) must be followed for both full papers and demos.

Publication
Accepted papers will be distributed via the CEUR workshop proceedings.

Past Events

We are building on an established tradition of eleven previous international VLDB workshops concerning data and information quality.