Semantic Version Management
Renée Miller (Northeastern University)
Data science is by its nature collaborative and, as a result, multiple versions of the same dataset are generated as a by-product of most data science activities. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this talk, I will present our new project on Semantic Version Management in which we seek to lay the foundations for semantic understanding of data changes that result in multiple versions and introduce scalable tools to uncover and explain data changes. Specifically, I will briefly introduce Explain-Da-V [1], a framework that explains changes between two given dataset versions using data transformations. I will then present open problems and challenges that include efficiently finding new tables that help explain a set of changes and discovering different versions of a dataset from within a massive table repository.
This is joint work with Professor Roee Shraga of the Worcester Polytechnic Institute.
[1] Roee Shraga, Renée J. Miller: Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V. Proc. VLDB Endow. 16(6): 1587-1600 (2023)
Is Data Management the Key to Successful AI Systems?
Theodoros Rekatsinas (Apple, USA)
Reasoning over relational data presents unique challenges and opportunities in the context of modern AI. This talk will explore how vector spaces offer a promising solution to data quality and data analytics problems by providing a unified representation for relational data. We will briefly review machine-learning models for embedding relational data in vector spaces and then dive deeper into systems, algorithms, and architectural designs for scaling these machine-learning approaches to billion-scale data inputs. In this second part of the talk, I will introduce Marius (https://marius-project.org), a system for scaling modern AI models to billion-scale graphs and discuss novel learned indexing methods for enabling high-throughput inference over embedded relational data. I will conclude the talk by discussing how the above techniques have been applied in an industry setting.