Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Preparation Bibliography

Data Preparation Overview

Joseph M. Hellerstein, Jeffrey Heer and Sean Kandel: Self-Service Data Preparation: Research to Practice IEEE Bulletin on Data Engineering 2018

Trifacta: End User Data Preparation  Market Study 2018

Gregorio Convertino, Andy Echenique: Self-Service Data Preparation and Analysis by Business Users: New Needs, Skills, and Tools CHI 2017

Nikolaos Konstantinou, Martin Koehler, Edward Abel1, Cristina Civili, Bernd Neumayr, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, Leonid Libkin, Norman W. Paton: 
The VADA Architecture for Cost-Effective Data Wrangling SIGMOD 2017

Tim Furche, Georg Gottlob, Leonid Libkin, Giorgio Orsi, Norman W. Paton: Data Wrangling for Big Data: Challenges and Opportunities EDBT 2016

Florian Endel, Harlad Piringer: Data Wrangling: Making data useful again IFAC-Papersonline 2015

Ignacio Terrizzano, Peter Schwarz, Mary Roth, John E. Colino: Data Wrangling: The Challenging Journey from the Wild to the Lake CIDR 2015 

Hadley Wickham: Tidy Data Journal of Statistical Software 2014

Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee5, Dominique Brodbeck and Paolo Buono:  Research directions in data wrangling: Visualizations and transformations for usable and credible data Information Visualization 2011

Shichao Zhang, Chengqi Zhang, Qiang Yang: Data Preparation for Data Mining Applied Artificial Intelligence 2003

Parsing Files

Yu Sun, Shaoxu Song, Chen Wang, Jianmin Wang:  Swapping Repair for Misplaced Attribute Values ICDE 2020

GJJ van den Burg, A. Nazábal, C. Sutton:  Wrangling messy CSV files by detecting row and type patterns  Data Mining and Knowledge Discovery 2019

Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, Donald Kossmann:  Speculative Distributed CSV Data Parsing for Big Data Analytics  SIGMOD 2019

Till Döhmen, Hannes Mühleisen, Peter Boncz:  Multi-Hypothesis CSV Parsing  SSDBM 2017

Johann Mitlohner, Sebastian Neumaier, Jurgen Umbrich, and Axel Polleres:  Characteristics of Open Data CSV Files  International Conference on Open and Big Data (OBD) 2016

Shitesh Saurav, Peter Schwarz:  A Machine-Learning Approach to Automatic Detection of Delimiters in Tabular Data Files  (HPCC/SmartCity/DSS) 2016

 

Data Transformation

Zhongjun Jin, Yeye He, Surajit Chauduri: Auto-Transform: Learning-to-Transform by Patterns VLDB 2020                                                                                                                                                                           

Zhongjun Jin, Michael Cafarella, H. V. Jagadish, Sean Kandel, Michael Minar, Joseph M. Hellerstein: CLX: Towards verifiable PBE data transformation EDBT 2019

Yeye He, Xu Chu, Kris Ganjam, Yudian Zhengy, Vivek Narasayya, Surajit Chaudhuri: Transform-Data-by-Example(TDE):An Extensible Search Engine for Data Transformations VLDB 2018

Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H. V. Jagadish: Foofah: Transforming Data By Example SIGMOD 2017 

Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: DataXFormer: A Robust Transformation Discovery System ICDE 2016

Rishabh Singh, Sumit Gulwani: Transforming spreadsheet data types using examples. POPL 2016

Rishabh Singh: BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 2016

John Morcos, Ziawasch Abedjan, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: DataXFormer: An Interactive Data Transformation Tool SIGMOD 2015

Jeffrey Heer, Joseph M. Hellerstein, Sean Kandel: Predictive Interaction for Data Transformation CIDR 2015

Rishabh Singh, Sumit Gulwani: Learning Semantic String Transformations from Examples. PVLDB 2012

William R. Harris, Sumit Gulwani: Spreadsheet Table Transformations from Examples PLDI 2011

Sean Kandel, Andreas Paepcke, Joseph Hellersteiny and Jeffrey Heer: Wrangler: Interactive Visual Specification of Data Transformation Scripts CHI 2011

Sumit Gulwani: Automating string processing in spreadsheets using input-output examples. POPL 2011

Multi-Table Detection

Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang: TableSense: Spreadsheet Table Detection with Convolutional Neural Networks AAAI 2019

Koci, Elvis, Maik Thiele, Oscar Romero, and Wolfgang Lehner: A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets IAPR 2019

Ghasemi Gol, Majid, Jay Pujara, and Pedro Szekely: Tabular Cell Classification Using Pre-Trained Cell Embeddings ICDM 2019

Zanibbi, Richard, Dorothea Blostein, and JamesR. Cordy: A Survey of Table Recognition: Models, Observations, Transformations, and Inferences IJDAR 2004

Comment Detection

Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael J. Cafarella, Jock D. Mackinlay: Spreadsheet Property Detection With Rule-assisted Active Learning. CIKM 2017

Marco D. Adelfio, Hanan Samet: Schema Extraction for Tabular Data on the Web. PVLDB 2013

Table Extraction / Understanding

Daniel W. Barowy, Sumit Gulwani, Ted Hart, Benjamin G. Zorn: FlashRelate: extracting relational data from semi-structured spreadsheets using examples. PLDI 2015

George Nagy, Sharad C. Seth, David W. Embley: End-to-End Conversion of HTML Tables for Populating a Relational Database. Document Analysis Systems 2014

Zhe Chen, Michael J. Cafarella: Automatic web spreadsheet data extraction. SSW@VLDB 2013

Christodoulakis, Christina, Eric B. Munson, Moshe Gabel, Angela Demke Brown, and Renée J. Miller: Pytheas: Pattern-based Table Discovery in CSV FilesPVLDB 2020.

Preparation Suggestion

Yan, Cong, and Yeye He. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. SIGMOD 2020

Koehler, Martin, Edward Abel, Alex Bogatu, Cristina Civili, Lacramioara Mazilu, Nikolaos Konstantinou, Alvaro Fernandes, John Keane, Leonid Libkin, and Norman W. Paton. Incorporating Data Context to Cost-Effectively Automate End-to-End Data WranglingIEEE Transactions on Big Data 2019

Norman W. Paton: Automating Data Preparation: Can We? Should We? Must We? DOLAP 2019

Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: Automated Data Pre-processing via Meta-learning. MEDI 2016

Jeffrey Heer, Joseph M. Hellerstein, Sean Kandel: Predictive Interaction for Data Transformation. CIDR 2015

Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, Jeffrey Heer: Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. UIST 2011

Morik, Katharina, and Martin Scholz. The miningmart approach to knowledge discovery in databases. Intelligent technologies for information analysis 2004.

Error Detection & Cleaning

Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang: Raha: A Configuration-Free Error Detection System. SIGMOD 2019

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, Theodoros Rekatsinas: HoloDetect: Few-Shot Learning for Error Detection. SIGMOD 2019

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré: HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 2017

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang: Detecting Data Errors: Where are we and what needs to be done? PVLDB 2016

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD 2015

 

Datasets

Koci, Elvis, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner: DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition ICDAR 2019

Hermans, Felienne, and Emerson Murphy-Hill: Enron’s Spreadsheets and Related Emails: A Dataset and Analysis ICSE 2015

Barik, Titus, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill: Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets MSR 2015

Fisher, Marc, and Gregg Rothermel: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms WEUSE 2005