Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI
Login
 

Preparing Raw Data Files For Data Ingestion

Mazhar Hameed

Information Systems Group
Hasso Plattner Institute

Office: Prof.-Dr.-Helmert-Straße 2-3 D-14482 Potsdam Room: F-2.05
Tel.:  +49 331 5509 274
Email: mazhar.hameed(at)hpi.de
Supervisor: Prof. Dr. Felix Naumann

 

My research aims to provide a data preparation platform that can help end-users prepare data efficiently and accurately by solving structural problems in data files. I started my research with a Survey of commercial data preparation tools to develop an understanding of the data preparation tools and their features available in the market. 

Currently, I am actively working on a project, SURAGH, to identify problematic rows (ill-formed) that make data loading cumbersome. 

Current Research

SURAGH - Syntactic Pattern Matching to Identify Ill-Formed Records

Comma-separated value (CSV) files are one of the most commonly used formats, and frequently used in scientific projects. Not only can these CSV files be used for exploring, collecting, and integrating data, but their use is also significant in other research areas, such as knowledge-based design, building machine learning models, and information mining. Despite the existing standard RFC4180, data entry into CSV files is prone to errors because users and applications do not always adhere to this standard; moreover, the standard itself is rather loose. Our goal is to identify such problems in advance and alert users or machines about the problematic records. We refer to such records as “ill-formed”.

With our research project, SURAGH, we focus on finding these ill-formed records in files by finding (1) common or frequent data patterns and, (2) irregular or unique data patterns.

Ill-Formed Records Examples

The SURAGH Algorithm

Pattern Generation

Syntactic patterns are a sequence of symbols to represent input values. For syntactic pattern generation, SURAGH parses values column by column and transforms them using abstractions based on its pattern grammar. The reasoning for a record to be identified as “incocnsitent” or “ill-formed” from the rest of the data, the record pattern should be different than the rest of the record patterns or exist in its group of patterns.

Pattern Grammar

To construct syntactic patterns, we use rules for transforming values and refer to them as abstractions, e.g., to transform the input value "9", we use the abstractions "<D>" and "<SEQD>".  To represent syntactic patterns, we specify the grammar using the extended Backus-Naur Form (EBNF) rules and its notations.

Pattern Pruning

During syntactic pattern generation, the algorithm aims to apply every combination of abstractions to generate all possible syntactic patterns for representing column values. However, not every generated pattern is suitable for the ill-formed record detection due to its generality and/or coverage. To select the highly specific and high coverage subset of column patterns from the generated patterns, we make use of specificity- and coverage-based pruning techniques and optimize the different phases of the algorithm.

Row Patterns

SURAGH creates row patterns by using all combinations of column patterns across all columns while pruning based on coverage. We use these row patterns to create a pattern schema and classify each record as ill- or well-formed.

Syntactic Pattern Applications

The syntactic patterns we introduce can assist the user in various tasks:

  • Make data parsable by detecting ill formed records.
  • Assist users in distinguishing files with multiple tables.
  • Assist users in predetermining how messy the input files are.
  • Basis for standardization of data.

Data Annotation and Experimentation

For our experiments we are using datasets collected from five different open data sources: DataGov Mendeley, GitHub, UKGov, NYCData
For ground truth creation, we annotated each record in each file as ill-or well-formed based on the common inconsistencies that cause well-formed rows to become ill-formed rows.

Below we list these inconsistencies:

Column level inconsistencies:

 

  •     Column type formatting.
  •     Null values.
  •     Out of range values.
  •     Line breaks within the value.
  •     Values non-compliant to schema.
  •     Values without escape character.
  •     Values with incorrect escape character.

    
Row level inconsistencies:

  •      Comment row.
  •      Header row.
  •      Group header row.
  •      Aggregation row.
  •      Empty row.
  •      Multi delimited row.
  •      Misplaced delimiters.
  •      Inconsistent number of columns.
  •      Footnote row.

Previous Work

Data Preparation: A Survey of Commercial Tools

Abstract:

Raw data are often messy: they follow different encodings, records are not well structured, values ​​do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. To cater to this demand, companies and researchers are developing techniques and tools for data preparation.
To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.

The Need For Data Preparation

Raw data appears in many situations: logs, sensor output, government data, medical research data, climate data, geospatial data, etc. It accumulates in many places, such as file systems, data lakes or online repositories. 
In typical scenarios, raw data from various sources is accrued without any standardized formats or structure and with no specific target use-case; thus, it can appear messy, contain invalid characters, use different encodings, lack necessary columns, contain unwanted rows, have missing values, not follow valid patterns, etc. 

One example scenario in need of data preparation technology are data lakes to store heterogeneous raw data. They can turn into vast repositories or collections of unstructured, semi-structured, unformulated, messy, and unclean data. The large volumes of data in data lakes are compelling and can generate valuable information, provided they are thoroughly pre-processed, cleaned, and prepared. With the ever-increasing amount of raw data, the need for data preparation has become more apparent.

Preparing data yields many advantages, such as prompt error detection, improved analytics, improved data quality, enhanced scalability, accelerated data usage and more easy data collaboration.

Data Preparation vs. Data Cleaning

We define data preparation as the set of preprocessing operations performed in early stages of a data processing pipeline, i.e., data transformationsat the structural and syntactical levels. In contrast, data cleaning concerns sub-sequent data transformations and corrections at the semantic level.

Data Preparation Tasks

Data preparation is not a single step process. Rather, it usually comprises many individual preparation steps, implemented by what we call preparators, and which we have organized anew into six broader categories, defined here.

Data discovery is the process of analyzing and collecting data from different sources, for instance to match data patterns, find missing data, and locate outliers.

Data validation comprises rules and constraints to inspect the data, for instance for correctness, completeness, and other data quality constraints.

Data structuring encompasses tasks for the creation, representation and structuring of information. Examples include updating schema, detecting and changing encoding and, transform data by example.

Data enrichment adds value or supplementary information to existing data from separate sources. Typically, it involves augmenting existing data with new or derived data values using data lookups, primary key generation, and inserting metadata.

Data filtering generates a subset of the data under consideration, facilitating manual inspection and removing irregular data rows or values. Examples include extracting text parts, and keeping or deleting filtered rows.

Data cleaning refers to removal, addition, or replacement of less accurate or inaccurate data values with more suitable, accurate or representative values. Typical examples are deduplication, fill missing values, and removing whitespace.

Despite our definition, which distinguishes data preparation and cleaning, we include data cleaning steps here as well, as most data preparation tools also venture into this area.

This Project Makes the Following Contributions:


1. Organisation: We propose six broad categories of data preparation and identify 40 common data preparation steps, which we classify into those categories.
2. Documentation: We validate the availability of these features and broader categories for seven selected tools and document them in a feature matrix.
3. Evaluation: We evaluate the selected features of surveyed tools to identify whether the tool offers the stated functionalities or not.
4. Recommendation: We identify shortcomings of commercial data preparation tools in general and encourage researchers to explore further in the field of data preparation.

Discovered Tools with Asserted Data Preparation Capabilities

We collected notable commercial data preparation tools gathered from business reports and analyses, company portals, and online demonstration videos. Our preliminary investigation resulted in 42 initial commercial tools , which we then examined for the extent of their data preparation capabilities.

Selected Data Preparation Tools

Criteria for Selected Tools:

1. Domain specificity: tools that specifically address the data preparation task.
2. Comprehensiveness: the extent and sophistication to which tools adequately covered preparation features.
3. Guides and documentation: the availability of proper documentation for the tools, i.e., useful, up-to-date documentation with listings of features and how-to guides.
4. Trial availability: the availability of a trial version, giving us the opportunity to test the tools and validate their features.
5. GUI: the availability of a comprehensive and intuitive graphical user interface to select and apply preparations.
6. Customer assistance: compliant support teams that assisted users with generic and specific tool queries, when needed.

Preparator Matrix

Table below provides a feature matrix showing which preparator is supported by which tool in each of the six categories.

Further Features of Data Preparation Tools

In our survey of commercial tools, we came across many functional and non-functional system features that did not cater to our data preparation focus. Nonetheless, these features are important and interesting to explore and use. Therefore, we have collected them as well.

Publications

  • Hameed, Mazhar, en Felix Naumann. “Data Preparation: A Survey of Commercial Tools”. SIGMOD Record 49, no. 3 (2020).