Spreadsheet datasets are valuable sources of data, but often ill-suited for machine-consumption. Their unstructured nature allows users to arrange data and metadata freely in a human-readable format, often in canvas-like layouts. To extract their content, data practitioners need to resort to manual inspection and run cumbersome preparation pipelines. The Mondrian system is designed to assist users in identifying and handling multiregion layout templates: spreadsheet layouts composed of independent regions that appear repeatedly across different files. Mondrian comprises an automated approach to detect multiple regions within a single file and an algorithm that leverages mapping region layouts to graphs to compute layout similarity and identify templates. Users interact with Mondrian through a web-based visual interface, that serves as a practical toolkit to handle collections of multiregion spreadsheets and enables their automated preparation.
Authors
Gerardo Vitagliano ( Hasso Plattner Institute )
Lucas Reisener ( Hasso Plattner Institute )
Lan Jiang ( Hasso Plattner Institute )
Mazhar Hameed ( Hasso Plattner Institute )
Felix Naumann ( Hasso Plattner Institute )