Gerardo Vitagliano
Preparing data. Exploring data. Visualizing data. Machine learning data. Data-ing data. No matter what your task is, there is a fundamental requirement: the ability to load data.
Thankfully, the CSV format sets a nice standard to distribute and consume raw data, right? Unfortunately, this is not always the case.
The flexible nature of the CSV format often leads to files that deviate from standards and best practices, e.g., containing preambles, multiple tables, strange dialects, etc., etc.
We define these files as "polluted".
This talk will address some of the research questions that arise with polluted files:
- What are typical, real-world CSV "pollutions" that prove challenging for data loading?
- How much "intelligence" can be expected in state-of-the-art systems to automatically load such messy files?
- How can we measure the correctness of data loading?
To answer these questions, we present Pollock, our data loading benchmark.