Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

SURAGH - Detecting ill-formed Records in CSV Files

Typically for any data driven application data loading is the initial process, even for data cleaning and preparation the data first must be successfully loaded. However, most data collections are prone to errors, which in turn can hinder data loading; we refer to records with such problems as "ill-formed records". Ill-formed records occur in raw data due to loosely defined schemata, incorrect formatting of values, record structure discrepancy, etc. To detect ill-formed records in files, we propose SURAGH that aims to utilize row patterns and identify irregularities by finding (1) common or frequent data patterns and, (2) irregular or unique data patterns. 

Open Data Portal CSV Files

Comma-separated value (CSV) files are one of the most commonly used formats, widely available on most open data portals, and frequently used in scientific projects. Not only can these CSV files be used for exploring, collecting, and integrating data, but their use is also significant in other research areas, such as knowledge-based design, building machine learning models, and information mining.

Despite the existing standard RFC 4180, data entry into CSV files is prone to errors because users and applications do not always ad-here to this standard; moreover, the standard itself is rather loose. This behavior is also present in CSV files available on open data portals. Out of 2066 files randomly selected from a government data portal Data.Gov, we were unable to directly load 418 (20.2%) of them into an RDBMS due to ill-formed records. 
Similar to Data.Gov, we crawled data from four other open data portals (Mendeley, GitHub, UKGov, NYCData) and found files with ill formed records. We used files from all five sources in our experiments.

Example Files

The Workflow of SURAGH

Example File with Dominant Syntactic Row Patterns (blue and green) and Ill-Formed Records (red).

Pattern Generation

Syntactic patterns are a sequence of symbols to represent input values. For syntactic pattern generation, SURAGH parses values column by column and transforms them using abstractions based on its pattern grammar. The reasoning for a record to be identified as “incocnsitent” or “ill-formed” from the rest of the data, the record pattern should be different than the rest of the record patterns or exist in its group of patterns.

Pattern Grammar

To construct syntactic patterns, we use rules for transforming values and refer to them as abstractions, e.g., to transform the input value "9", we use the abstractions "<D>" and "<SEQD>".  To represent syntactic patterns, we specify the grammar using the extended Backus-Naur Form (EBNF) rules and its notations.

Pattern Pruning

During syntactic pattern generation, the algorithm aims to apply every combination of abstractions to generate all possible syntactic patterns for representing column values. However, not every generated pattern is suitable for the ill-formed record detection due to its generality and/or coverage. To select the highly specific and high coverage subset of column patterns from the generated patterns, we make use of specificity- and coverage-based pruning techniques and optimize the different phases of the algorithm.

Row Patterns

SURAGH creates row patterns by using all combinations of column patterns across all columns while pruning based on coverage. We use these row patterns to create a pattern schema and classify each record as ill- or well-formed.

Data Annotation and Experimentation

For our experiments we are using datasets collected from five different open data sources: DataGov Mendeley, GitHub, UKGov, NYCData. To create the ground truth, we annotated each record in each file as either ill-formed or well-formed, based on the common inconsistencies that cause well-formed rows to become ill-formed rows. The following is a list of these inconsistencies:

Column level inconsistencies:

  •     Column type formatting.
  •     Null values.
  •     Out of range values.
  •     Line breaks within the value.
  •     Values non-compliant to schema.
  •     Values without escape character.
  •     Values with incorrect escape character.

Row level inconsistencies:

  •      Comment row.
  •      Header row.
  •      Group header row.
  •      Aggregation row.
  •      Empty row.
  •      Multi delimited row.
  •      Misplaced delimiters.
  •      Inconsistent number of columns.
  •      Footnote row.

 

Resources

Code & Datasets

SURAGH is an open-source project. Its code along with datasets and annotations are available on the project page at GitHub.

Contacts

The research project is conducted at the Data Preparation group. If you have any question, interesting idea, or request, please contact Mazhar Hameed.