Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Registration

To register for this seminar, please register with the Studienreferat and send an e-mail to gerardo.vitagliano(at)hpi.de with the subject: "Registration to Table Recognition".

The registered students will receive a zoom link per e-mail to attend the seminar sessions.

Description

Structured files, like spreadsheets, are valuable sources of data, but often ill-suited for machine-consumption. Although spreadsheets contain cells in a grid-like structure, the data they contain is often arranged with a free layout, with no clearly defined tabular structure. Or worse, tables are arranged in several, independent regions that have to ultimately be recognized and merged by end-users which are interested in their content. In light of automated data preparation, extraction, or integration, there is great value in recognizing the presence and layout of regions, especially tables, within a spreadsheet.

Table recognition is a well-known problem, tackled by different researchers on various domains, and with different assumptions. In this seminar, we will introduce you to the research area of table recognition in spreadsheet files. Each team, ideally consisting of 2 students, will explore, implement and potentially improve on a different solution to detect and extract tables from spreadsheet files.

We will provide you with state of the art papers that suggest solutions to the above problem, which you will implement and try to improve upon with your own ideas in a scalable way. We will provide thousands of files for testing and evaluation.

Time Table

When: Wednesdays 15:15 at Campus II, Building F, Room F-2-11 / online on Zoom. The following timetable is still tentative.

Date

Topic

Slides

14.4.21

Introduction

 

21.4.2021

Group allocation + Research Framework

 

28.4.2021

Basics of literature search and giving technical talks

 

5.5.2021

Progress report

 

12.5.2021

Progress report

 

19.5.2021

Technical talk about specific paper (baseline solution)

 

26.5.2021

Research Process – practical hints and clues

 

2.6.2021

Mid-term presentation  (implementation) 

9.6.2021-23.06.2021

Progress report

 

7.7.2021

Presentation of Improvements

 

7.7.2021-14.7.2021

Progress Report

 

21.7.2021

End-term presentation 

End of September

Final submission 

Literature

You can find the following papers on dblp or google-scholar:

  • Barik, T., Lubick, K., Smith, J., Slankas, J. and Murphy-Hill, E. 2015. Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets. IEEE/ACM Working Conference on Mining Software Repositories.

  • Christodoulakis, C., Munson, E.B. and Gabel, M. 2020. Pytheas: Pattern-based Table Discovery in CSV Files. PVLDB

  • Dong, H., Liu, S., Han, S., Fu, Z. and Zhang, D. 2019. TableSense: Spreadsheet Table Detection with Convolutional Neural Networks.Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).

  • Gilani, A., Qasim, S.R., Malik, I. and Shafait, F. 2017. Table Detection Using Deep Learning.IAPR International Conference on Document Analysis and Recognition (ICDAR)

  • Hermans, F. and Murphy-Hill, E. 2015. Enron’s Spreadsheets and Related Emails: A Dataset and Analysis. IEEE/ACM IEEE International Conference on Software Engineering (ICSE)

  • Fisher, M. and Rothermel, G. 2005. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. ACM SIGSOFT Software Engineering Notes.

  • Koci, E., Thiele, M., Romero, O. and Lehner, W. 2017. Table Identification and Reconstruction in Spreadsheets. International Conference on Document Analysis and Recognition (ICDAR)

  • Koci, E., Thiele, M., Rehak, J., Romero, O. and Lehner, W. 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. International Conference on Advanced Information Systems Engineering (CAISE)

  • Koci, E., Thiele, M., Romero, O. and Lehner, W. 2019. A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets. International Conference on Document Analysis and Recognition (ICDAR

  • Mitlohner, J., Neumaier, S., Umbrich, J. and Polleres, A. 2016. Characteristics of Open Data CSV Files. International Conference on Open and Big Data (OBD)

By Topic:

Datasets:

  • Hermans, F.  and Murphy-Hill,E. 2015. Enron’s Spreadsheets and Related Emails: A Dataset and Analysis. IEEE International Conference on Software Engineering (ICSE).

  • Barik, T., Lubick, K., Smith, J., Slankas, J. and Murphy-Hill, E. 2015. Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets. IEEE/ACM Working Conference on Mining Software Repositories.

  • Fisher, M. and Rothermel, G. 2005. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. ACM SIGSOFT Software Engineering Notes.

  • Koci, E., Thiele, M., Rehak, J., Romero, O. and Lehner, W. 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. International Conference on Document Analysis and Recognition (ICDAR)

  • Mitlohner, J., Neumaier, S., Umbrich, J. and Polleres, A. 2016. Characteristics of Open Data CSV Files. International Conference on Open and Big Data (OBD)

  • Zanibbi, R., Blostein, D. and Cordy, James R. 2004. A survey of table recognition: Models, observations, transformations, and inferences. Document Analysis and Recognition.

Table Recognition

  • Christodoulakis, C., Munson, E.B. and Gabel, M. 2020. Pytheas: Pattern-based Table Discovery in CSV Files. PVLDB

  • Dong, H., Liu, S., Han, S., Fu, Z. and Zhang, D. 2019. TableSense: Spreadsheet Table Detection with Convolutional Neural Networks.Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).

  • Gilani, A., Qasim, S.R., Malik, I. and Shafait, F. 2017. Table Detection Using Deep Learning.IAPR International Conference on Document Analysis and Recognition (ICDAR)

  • Koci, E., Thiele, M., Rehak, J., Romero, O. and Lehner, W. 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. International Conference on Advanced Information Systems Engineering (CAISE)

  • Koci, E., Thiele, M., Romero, O. and Lehner, W. 2019. A Genetic-Based Search for Adaptive Table Recognition in Spreadsheets. International Conference on Document Analysis and Recognition (ICDAR

Datasets

Here we list the datasets and their annotations useful for table recognition. Note that due to license issues, only publicly distributable datasets are listed here. Each link points to a compressed json file that includes both the verbose CSV files and their annotations.

Dataset# Files# Regions# TemplatesDescription
DECO8543,785750The DECO dataset, an excerpt of the larger Enron spreadsheet corpus annotated for the region extraction/table recognition and template extraction tasks.
FUSTE8861,857136The FUSTE dataset, an excerpt of the larger FUSE spreadsheet corpus, annotated for the region extraction/table recognition and template extraction tasks.

Organization

  • Project seminar for master students 
  • Language of instruction: English
  • Maximum number of participants: 6

Students form teams of one or two members. Each team is assigned a task and the according publications. After studying this (and further) literature, the teams should present a summary of the state of the art solution and in parallel also have a practical implementation. To present the baseline and the results of the first phase to the whole group, all teams will give a mid-term presentations.

In the second half of the seminar, each team tries to improve or find a better solution for thier task. The team members should finally report on their improvements in a last presentation. To conclude the seminar, each team needs to prepare a paper-style submission of thier solution .

Grading

The final grade is weighted by 6 LP and considers the following:

  • (15%) Active participation in meetings and discussions
  • (15%) Technical presentation of a scientific paper (the chosen baseline).
  • (20%) End-term presentation
  • (20%) Quality of implementation and coding style
  • (30%) Final paper-style submission