In general, polluted CSV files can contain multiple "regions" that may serve multiple purposes: tables, footnotes, or additional metadata. We define such files as multiregion. In files that originate from the same source, it is possible that region layouts are repeated equally but with different data content. In light of automated data preparation, extraction, or integration, there is great value in recognizing the presence and layout of regions within a file, and discovering multiregion templates, i.e., file layouts that occur in multiple files.
We developed Mondrian , an automated approach to detect multiple regions in a spreadsheet, describe their layout using a graph representation, and compare these layouts with a similarity flooding-based algorithm.
First, the cells of a spreadsheet are converted into pixels, encoded with different colors, such that cells with similar syntax share similar colors (e.g., integer numbers are dark blue and floating point numbers are light blue, while strings are red).
Then, Mondrian detects multiple regions, partitioning the image into groups of adjacent pixels and clustering them together to form regions.
For each of the extracted regions, Mondrian calculates a "fingerprint" that reflect the region's syntactic and structural properties. This fingerprint is then used to compare regions across different files. Files that show at least one similar region are more probable to share the same layout. Therefore, they are selected as candidates instances of the same template.
Each file layout is then described with a connected graph that encodes the information about the regions that it contains and their connectivity (i.e., how they are arranged relative to each other in the spreadsheet). Finally, using a graph similarity score, we measure the similarity between file layouts that had similar regions.
To improve its usability, Mondrian includes a graphical interface to assist end-users . Users are able to load a given spreadsheet file, run the automated region detection and template recognition algorithm, and visualize its results as well as interactively adjust the detected regions and templates.