Our paper "Addresing Data Management Challenges for Interoperable Data Science" by Ilin Tolovski and Tilmann Rabl was accepted at DATAI '24 (co-located with VLDB).
Abstract:
The development of data science pipelines (DSPs) has been steadily growing in popularity. While the increasing number of applications can also be attributed to novel algorithms and analytics libraries, the interoperability of new DSPs has been limited. To investigate this, we curated a corpus of over 494k GitHub Python repositories. We find that only 20% of the data science pipelines provide access to their input data and only 14% use a data backend. These findings highlight the key pain points in the development of interoperable DSPs.
We identify five open data management challenges related to pipeline analysis, data access, and storage. We introduce Stork, a system for automated pipeline analysis, transformation, and data migration. Stork provides open data access while removing the human in the loop when reproducing results and migrating projects to different storage and execution environments. We analyze terabytes of DSPs with Stork and successfully process 72% of the pipelines, transforming 75% of the accessible datasets.