Structured files, like spreadsheets, are valuable sources of data, but often ill-suited for machine-consumption. Although spreadsheets contain cells in a grid-like structure, the data they contain is often arranged with a free layout, with no clearly defined tabular structure. Or worse, tables are arranged in several, independent regions that have to ultimately be recognized and merged by end-users which are interested in their content. In light of automated data preparation, extraction, or integration, there is great value in recognizing the presence and layout of regions, especially tables, within a spreadsheet.
Table recognition is a well-known problem, tackled by different researchers on various domains, and with different assumptions. In this seminar, we will introduce you to the research area of table recognition in spreadsheet files. Each team, ideally consisting of 2 students, will explore, implement and potentially improve on a different solution to detect and extract tables from spreadsheet files.
We will provide you with state of the art papers that suggest solutions to the above problem, which you will implement and try to improve upon with your own ideas in a scalable way. We will provide thousands of files for testing and evaluation.