Data Preparation Bibliography
Data Preparation Overview
Joseph M. Hellerstein, Jeffrey Heer and Sean Kandel: Self-Service Data Preparation: Research to Practice IEEE Bulletin on Data Engineering 2018
Trifacta: End User Data Preparation Market Study 2018
Gregorio Convertino, Andy Echenique: Self-Service Data Preparation and Analysis by Business Users: New Needs, Skills, and Tools CHI 2017
Nikolaos Konstantinou, Martin Koehler, Edward Abel1, Cristina Civili, Bernd Neumayr, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, Leonid Libkin, Norman W. Paton:
The VADA Architecture for Cost-Effective Data Wrangling SIGMOD 2017
Tim Furche, Georg Gottlob, Leonid Libkin, Giorgio Orsi, Norman W. Paton: Data Wrangling for Big Data: Challenges and Opportunities EDBT 2016
Florian Endel, Harlad Piringer: Data Wrangling: Making data useful again IFAC-Papersonline 2015
Ignacio Terrizzano, Peter Schwarz, Mary Roth, John E. Colino: Data Wrangling: The Challenging Journey from the Wild to the Lake CIDR 2015
Hadley Wickham: Tidy Data Journal of Statistical Software 2014
Sean Kandel, Jeffrey Heer, Catherine Plaisant, Jessie Kennedy, Frank van Ham, Nathalie Henry Riche, Chris Weaver, Bongshin Lee5, Dominique Brodbeck and Paolo Buono: Research directions in data wrangling: Visualizations and transformations for usable and credible data Information Visualization 2011
Shichao Zhang, Chengqi Zhang, Qiang Yang: Data Preparation for Data Mining Applied Artificial Intelligence 2003
Parsing Files
Yu Sun, Shaoxu Song, Chen Wang, Jianmin Wang: Swapping Repair for Misplaced Attribute Values ICDE 2020
GJJ van den Burg, A. Nazábal, C. Sutton: Wrangling messy CSV files by detecting row and type patterns Data Mining and Knowledge Discovery 2019
Chang Ge, Yinan Li, Eric Eilebrecht, Badrish Chandramouli, Donald Kossmann: Speculative Distributed CSV Data Parsing for Big Data Analytics SIGMOD 2019
Till Döhmen, Hannes Mühleisen, Peter Boncz: Multi-Hypothesis CSV Parsing SSDBM 2017
Johann Mitlohner, Sebastian Neumaier, Jurgen Umbrich, and Axel Polleres: Characteristics of Open Data CSV Files International Conference on Open and Big Data (OBD) 2016
Shitesh Saurav, Peter Schwarz: A Machine-Learning Approach to Automatic Detection of Delimiters in Tabular Data Files (HPCC/SmartCity/DSS) 2016
Data Transformation
Zhongjun Jin, Yeye He, Surajit Chauduri: Auto-Transform: Learning-to-Transform by Patterns VLDB 2020
Zhongjun Jin, Michael Cafarella, H. V. Jagadish, Sean Kandel, Michael Minar, Joseph M. Hellerstein: CLX: Towards verifiable PBE data transformation EDBT 2019
Yeye He, Xu Chu, Kris Ganjam, Yudian Zhengy, Vivek Narasayya, Surajit Chaudhuri: Transform-Data-by-Example(TDE):An Extensible Search Engine for Data Transformations VLDB 2018
Zhongjun Jin, Michael R. Anderson, Michael Cafarella, H. V. Jagadish: Foofah: Transforming Data By Example SIGMOD 2017
Ziawasch Abedjan, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: DataXFormer: A Robust Transformation Discovery System ICDE 2016
Rishabh Singh, Sumit Gulwani: Transforming spreadsheet data types using examples. POPL 2016
Rishabh Singh: BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 2016
John Morcos, Ziawasch Abedjan, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker: DataXFormer: An Interactive Data Transformation Tool SIGMOD 2015
Jeffrey Heer, Joseph M. Hellerstein, Sean Kandel: Predictive Interaction for Data Transformation CIDR 2015
Rishabh Singh, Sumit Gulwani: Learning Semantic String Transformations from Examples. PVLDB 2012
William R. Harris, Sumit Gulwani: Spreadsheet Table Transformations from Examples PLDI 2011
Sean Kandel, Andreas Paepcke, Joseph Hellersteiny and Jeffrey Heer: Wrangler: Interactive Visual Specification of Data Transformation Scripts CHI 2011
Sumit Gulwani: Automating string processing in spreadsheets using input-output examples. POPL 2011
Multi-Table Detection
Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang: TableSense: Spreadsheet Table Detection with Convolutional Neural Networks AAAI 2019
Koci, Elvis, Maik Thiele, Oscar Romero, and Wolfgang Lehner: A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets IAPR 2019
Ghasemi Gol, Majid, Jay Pujara, and Pedro Szekely: Tabular Cell Classification Using Pre-Trained Cell Embeddings ICDM 2019
Zanibbi, Richard, Dorothea Blostein, and JamesR. Cordy: A Survey of Table Recognition: Models, Observations, Transformations, and Inferences IJDAR 2004
Comment Detection
Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael J. Cafarella, Jock D. Mackinlay: Spreadsheet Property Detection With Rule-assisted Active Learning. CIKM 2017
Marco D. Adelfio, Hanan Samet: Schema Extraction for Tabular Data on the Web. PVLDB 2013
Table Extraction / Understanding
Daniel W. Barowy, Sumit Gulwani, Ted Hart, Benjamin G. Zorn: FlashRelate: extracting relational data from semi-structured spreadsheets using examples. PLDI 2015
George Nagy, Sharad C. Seth, David W. Embley: End-to-End Conversion of HTML Tables for Populating a Relational Database. Document Analysis Systems 2014
Zhe Chen, Michael J. Cafarella: Automatic web spreadsheet data extraction. SSW@VLDB 2013
Christodoulakis, Christina, Eric B. Munson, Moshe Gabel, Angela Demke Brown, and Renée J. Miller: Pytheas: Pattern-based Table Discovery in CSV Files. PVLDB 2020.
Preparation Suggestion
Yan, Cong, and Yeye He. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. SIGMOD 2020
Koehler, Martin, Edward Abel, Alex Bogatu, Cristina Civili, Lacramioara Mazilu, Nikolaos Konstantinou, Alvaro Fernandes, John Keane, Leonid Libkin, and Norman W. Paton. Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling. IEEE Transactions on Big Data 2019
Norman W. Paton: Automating Data Preparation: Can We? Should We? Must We? DOLAP 2019
Besim Bilalli, Alberto Abelló, Tomàs Aluja-Banet, Robert Wrembel: Automated Data Pre-processing via Meta-learning. MEDI 2016
Jeffrey Heer, Joseph M. Hellerstein, Sean Kandel: Predictive Interaction for Data Transformation. CIDR 2015
Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, Jeffrey Heer: Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. UIST 2011
Morik, Katharina, and Martin Scholz. The miningmart approach to knowledge discovery in databases. Intelligent technologies for information analysis 2004.
Error Detection & Cleaning
Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang: Raha: A Configuration-Free Error Detection System. SIGMOD 2019
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, Theodoros Rekatsinas: HoloDetect: Few-Shot Learning for Error Detection. SIGMOD 2019
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré: HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 2017
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang: Detecting Data Errors: Where are we and what needs to be done? PVLDB 2016
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD 2015
Datasets
Koci, Elvis, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner: DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition ICDAR 2019
Hermans, Felienne, and Emerson Murphy-Hill: Enron’s Spreadsheets and Related Emails: A Dataset and Analysis ICSE 2015
Barik, Titus, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill: Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets MSR 2015
Fisher, Marc, and Gregg Rothermel: The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms WEUSE 2005