Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI
  
Login
  • de
 

Cleaning Messy Records in Data Files

Mazhar Hameed

Information Systems Group
Hasso Plattner Institute

Office: Prof.-Dr.-Helmert-Straße 2-3 D-14482 Potsdam Room: F-2.05
Tel.:  +49 331 5509 274
Email: mazhar.hameed(at)hpi.de
Supervisor: Prof. Dr. Felix Naumann

 

Typically for any data driven application data loading is the initial process, even for data cleaning and preparation the data first must be successfully loaded. However, most data collections are prone to errors that can in turn cause hindrance in data loading. These issues generally stem from pre-exisiting errors in data files such as, (1) empty or null values, (2) out of range values, (3) inconsistent column data and (4) misplaced delimiters to name a few. To address issues pertaining to data load in data stream applications, we propose a methodology that aims to utilize data patterns and identify irregularities by finding (1) common or frequent data patterns and, (2) irregular or unique data patterns. The reasoning for a record or column to be identified as “incocnsitent” or “out of form” from the rest of the data, the record pattern should be different than the rest of the record patterns or exist in its group of patterns.

Overview:

Normally, records that don’t resonate with the rest of the data are one of the contributing factors in data loading issues. In context of our research or by our definition, such records are identified as “messy records”. For a record to be regarded as “messy”, it is not necessary that all columns and cell values of that record should be irregular. It is possible for a record to have a single value out of order and context that can cause for this record to stand out as an outlier from the rest of the data.

Current Reserach:

Our research is to identify and tackle such messy records by proposing a technique that is not reliant on external information, such as field data types, record structure, file dialect, etc.

Previous Work

Data Preparation: A Survey of Commercial Tools

Abstract:

Raw data are often messy: they follow different encodings, records are not well structured, values ​​do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. To cater to this demand, companies and researchers are developing techniques and tools for data preparation.
To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.

Our paper makes the following contributions:


1. Organisation: We propose six broad categories of data preparation and identify 40 common data preparation steps, which we classify into those categories.
2. Documentation: We validate the availability of these features and broader categories for seven selected tools and document them in a feature matrix.
3. Evaluation: We evaluate the selected features of surveyed tools to identify whether the tool offers the stated functionalities or not.
4. Recommendation: We identify shortcomings of commercial data preparation tools in general and encourage researchers to explore further in the field of data preparation.

Discovered Tools with Asserted Data Preparation Capabilities

We collected notable commercial data preparation tools gathered from business reports and analyses, company portals, and online demonstration videos. Our preliminary investigation resulted in 42 initial commercial tools , which we then examined for the extent of their data preparation capabilities.

Selected Data Preparation Tools

Criteria for Selected Tools:

1. Domain specificity: tools that specifically address the data preparation task.
2. Comprehensiveness: the extent and sophistication to which tools adequately covered preparation features.
3. Guides and documentation: the availability of proper documentation for the tools, i.e., useful, up-to-date documentation with listings of features and how-to guides.
4. Trial availability: the availability of a trial version, giving us the opportunity to test the tools and validate their features.
5. GUI: the availability of a comprehensive and intuitive graphical user interface to select and apply preparations.
6. Customer assistance: compliant support teams that assisted users with generic and specific tool queries, when needed.

Preparator Matrix

Table below provides a feature matrix showing which preparator is supported by which tool in each of the six categories.

Publications

  • Hameed, Mazhar, en Felix Naumann. “Data Preparation: A Survey Of Commercial Tools”. In Sigmod Record, 2020.