Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Data Preparation Bibliography

Data Preparation Overview

Joseph M. Hellerstein, Jeffrey Heer and Sean Kandel: Self-Service Data Preparation: Research to Practice IEEE Bulletin on Data Engineering 2018

Trifacta: End User Data Preparation  Market Study 2018

Gregorio Convertino, Andy Echenique: Self-Service Data Preparation and Analysis by Business Users: New Needs, Skills, and Tools CHI 2017

Nikolaos Konstantinou, Martin Koehler, Edward Abel1, Cristina Civili, Bernd Neumayr, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, Leonid Libkin, Norman W. Paton: 
The VADA Architecture for Cost-Effective Data Wrangling SIGMOD 2017

Tim Furche, Georg Gottlob, Leonid Libkin, Giorgio Orsi, Norman W. Paton: Data Wrangling for Big Data: Challenges and Opportunities EDBT 2016

Florian Endel, Harlad Piringer: Data Wrangling: Making data useful again IFAC-Papersonline 2015

Ignacio Terrizzano, Peter Schwarz, Mary Roth, John E. Colino: Data Wrangling: The Challenging Journey from the Wild to the Lake CIDR 2015 

Hadley Wickham: Tidy Data Journal of Statistical Software 2014

Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee5, Dominique Brodbeck and Paolo Buono:  Research directions in data wrangling: Visualizations and transformations for usable and credible data Information Visualization 2011

Shichao Zhang, Chengqi Zhang, Qiang Yang: Data Preparation for Data Mining Applied Artificial Intelligence 2003

Parsing Files

G. J. J. van den Burg, A. Nazábal, C. Sutton: Wrangling messy CSV files by detecting row and type patterns Data Mining and Knowledge Discovery 2019

Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, Donald Kossmann: Speculative Distributed CSV Data Parsing for Big Data Analytics SIGMOD 2019

Till Döhmen, Hannes Mühleisen, Peter Boncz: Multi-Hypothesis CSV Parsing SSDBM 2017

Johann Mitlohner, Sebastian Neumaier, Jurgen Umbrich, and Axel Polleres: Characteristics of Open Data CSV Files International Conference on Open and Big Data (OBD) 2016

 

Data Transformation

Zhongjun Jin, Michael Cafarella, H. V. Jagadish, Sean Kandel, Michael Minar, Joseph M. Hellerstein: CLX: Towards verifiable PBE data transformation EDBT 2019

Yeye He, Xu Chu, Kris Ganjam, Yudian Zhengy, Vivek Narasayya, Surajit Chaudhuri: Transform-Data-by-Example(TDE):An Extensible Search Engine for Data Transformations VLDB 2018

Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H. V. Jagadish: Foofah: Transforming Data By Example SIGMOD 2017 

Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: DataXFormer: A Robust Transformation Discovery System ICDE 2016

John Morcos, Ziawasch Abedjan, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: DataXFormer: An Interactive Data Transformation Tool SIGMOD 2015

Jeffrey Heer, Joseph M. Hellerstein, Sean Kandel: Predictive Interaction for Data Transformation CIDR 2015

William R. Harris, Sumit Gulwani: Spreadsheet Table Transformations from Examples PLDI 2011

Sean Kandel, Andreas Paepcke, Joseph Hellersteiny and Jeffrey Heer: Wrangler: Interactive Visual Specification of Data Transformation Scripts CHI 2011

Sumit Gulwani: Automating string processing in spreadsheets using input-output examples. POPL 2011

Rishabh Singh, Sumit Gulwani: Transforming spreadsheet data types using examples. POPL 2016

Rishabh Singh: BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 2016

Rishabh Singh, Sumit Gulwani: Learning Semantic String Transformations from Examples. PVLDB 2012

Multi-Table Detection

Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang: TableSense: Spreadsheet Table Detection with Convolutional Neural Networks. AAAI 2019

Comment Detection

Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael J. Cafarella, Jock D. Mackinlay: Spreadsheet Property Detection With Rule-assisted Active Learning. CIKM 2017

Marco D. Adelfio, Hanan Samet: Schema Extraction for Tabular Data on the Web. PVLDB 2013

Table Extraction / Understanding

Daniel W. Barowy, Sumit Gulwani, Ted Hart, Benjamin G. Zorn: FlashRelate: extracting relational data from semi-structured spreadsheets using examples. PLDI 2015

George Nagy, Sharad C. Seth, David W. Embley: End-to-End Conversion of HTML Tables for Populating a Relational Database. Document Analysis Systems 2014

Zhe Chen, Michael J. Cafarella: Automatic web spreadsheet data extraction. SSW@VLDB 2013

 

 

Preparation Suggestion

Koehler, Martin, Edward Abel, Alex Bogatu, Cristina Civili, Lacramioara Mazilu, Nikolaos Konstantinou, Alvaro Fernandes, John Keane, Leonid Libkin, and Norman W. Paton. Incorporating Data Context to Cost-Effectively Automate End-to-End Data WranglingIEEE Transactions on Big Data 2019

Norman W. Paton: Automating Data Preparation: Can We? Should We? Must We? DOLAP 2019

Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: Automated Data Pre-processing via Meta-learning. MEDI 2016

Jeffrey Heer, Joseph M. Hellerstein, Sean Kandel: Predictive Interaction for Data Transformation. CIDR 2015

Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, Jeffrey Heer: Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. UIST 2011

Morik, Katharina, and Martin Scholz. The miningmart approach to knowledge discovery in databases. Intelligent technologies for information analysis 2004.

 

Error Detection & Cleaning

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang: Raha: A Configuration-Free Error Detection System. SIGMOD 2019

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, Theodoros Rekatsinas: HoloDetect: Few-Shot Learning for Error Detection. SIGMOD 2019

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré: HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 2017

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang: Detecting Data Errors: Where are we and what needs to be done? PVLDB 2016

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD 2015