Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Preparing Raw Data Files For Data Ingestion

Mazhar Hameed

Information Systems Group
Hasso Plattner Institute

Office: Prof.-Dr.-Helmert-Straße 2-3 D-14482 Potsdam Room: F-2.05
Tel.:  +49 331 5509 274
Email: mazhar.hameed(at)hpi.de
Supervisor: Prof. Dr. Felix Naumann


We live in the age of technology where data is the new oil, and much like its predecessor, data must be extracted and refined before it has any practical use. However, unlike its predecessor, the amount of data generated is enormous and growing exponentially, spurred by surveillance devices generating sensor data, social media platforms, government data portals, medical research projects, etc. Unfortunately, data curated from these devices and platforms are often in a raw format, so parsing it without a standardized format introduces many structural inconsistencies, such as invalid characters due to incorrect parsing, column shifting due to incorrect escaping, and inconsistent formatting, etc., which causes data manipulation problems. Consequently, data scientists and machine learning engineers spend most of their time on the tedious tasks of data preparation.

My research aims to provide a data preparation platform to help end-users correctly parse and accurately prepare data by solving the structural problems mentioned above. To this end, I began my research by examining available data preparation tools and libraries to develop an understanding of the existing systems and their data preparation features. We began our survey with the discovery phase and collected more than 100 tools and libraries, which we narrowed down to 42 commercial systems that offer some data preparation capabilities. We made several contributions in our study, such as proposing broader categories of data preparation, identifying 40 common data preparation tasks, evaluating the state of the art of data preparation tools, and listing the prominent challenges that we came across that may lead to different research topics.

After looking at several tools, libraries and their features, I found that all the available tools offer great features, but what interested me the most was that these tools assume the pre-processed input files (no structural errors). Therefore, I focused my research on improving the pre-processing of raw data such as txt, csv, tsv, etc. To improve the structure of a file, we need to understand the structure of the records in that file. To do so, I have developed a system, SURAGH, that finds structural patterns in an input file and classifies clean (well-formed) and problematic (ill-formed) rows based on a pattern schema.

I am currently actively working on a project, TAHARAT, to clean up the structure of identified problematic rows (ill-formed) so that users can load their data efficiently.

SURAGH - Detecting ill-formed Records in CSV Files

Despite the existing standard, open data CSV files often contain records that do not adhere to the structural specifications of a CSV file. We call records with such structural inconsistencies as “ill-formed” records. A record may be ill-formed due to inconsistencies that can exhibit at both column- and row-levels. Loading a CSV file that contains ill-formed records is a tedious task. Users spend a lot of time manually checking the load-abort process, which stops loading data after many minutes due to an ill-formed record. To assist users in loading their data, we proposed SURAGH. SURAGH aims to utilize row patterns and identify irregularities by finding (1) common or frequent data patterns, and (2) irregular or unique data patterns to classify ill- and well-formed records..

Ill-Formed Records Examples

The Workflow of SURAGH

Example File with Dominant Syntactic Row Patterns (blue and green) and Ill-Formed Records (red).

Pattern Generation

Syntactic patterns are a sequence of symbols to represent input values. For syntactic pattern generation, SURAGH parses values column by column and transforms them using abstractions based on its pattern grammar. The reasoning for a record to be identified as “incocnsitent” or “ill-formed” from the rest of the data, the record pattern should be different than the rest of the record patterns or exist in its group of patterns.

Pattern Grammar

To construct syntactic patterns, we use rules for transforming values and refer to them as abstractions, e.g., to transform the input value "9", we use the abstractions "<D>" and "<SEQD>".  To represent syntactic patterns, we specify the grammar using the extended Backus-Naur Form (EBNF) rules and its notations.

Pattern Pruning

During syntactic pattern generation, the algorithm aims to apply every combination of abstractions to generate all possible syntactic patterns for representing column values. However, not every generated pattern is suitable for the ill-formed record detection due to its generality and/or coverage. To select the highly specific and high coverage subset of column patterns from the generated patterns, we make use of specificity- and coverage-based pruning techniques and optimize the different phases of the algorithm.

Row Patterns

SURAGH creates row patterns by using all combinations of column patterns across all columns while pruning based on coverage. We use these row patterns to create a pattern schema and classify each record as ill- or well-formed.

Syntactic Pattern Applications

The syntactic patterns we introduce can assist the user in various tasks:

  • Make data parsable by detecting ill formed records.
  • Assist users in distinguishing files with multiple tables.
  • Assist users in predetermining how messy the input files are.
  • Basis for standardization of data.

Data Annotation and Experimentation

For our experiments we are using datasets collected from five different open data sources: DataGov Mendeley, GitHub, UKGov, NYCData. To create the ground truth, we annotated each record in each file as either ill-formed or well-formed, based on the common inconsistencies that cause well-formed rows to become ill-formed rows. The following is a list of these inconsistencies:

Column level inconsistencies:

  •     Column type formatting.
  •     Null values.
  •     Out of range values.
  •     Line breaks within the value.
  •     Values non-compliant to schema.
  •     Values without escape character.
  •     Values with incorrect escape character.

Row level inconsistencies:

  •      Comment row.
  •      Header row.
  •      Group header row.
  •      Aggregation row.
  •      Empty row.
  •      Multi delimited row.
  •      Misplaced delimiters.
  •      Inconsistent number of columns.
  •      Footnote row.

TAHARAT- Cleaning ill-formed Records in CSV Files

We propose TAHARAT, a system built on our state-of-the-art error detection system, SURAGH. TAHARAT aims to improve the processing of raw CSV files by standardizing the structure of ill-formed records into a uniform format based on the structure of well-formed records. It leverages syntax-based patterns both to understand the ill- and well-formedness of individual records in a file and to clean up their structure for uniformity.

Wanted and Unwanted Records

While ill-formed records often are table titles, footnotes, or empty rows (unwanted), many contain payload data (wanted), along with additional structural or formatting information and possibly additional attributes, which made them ill-formed in the first place. They may also be as structurally valid as well-formed records, but were classified as ill-formed due to their non-dominant structure. To recognize whether ill-formed records contain payload data or not, we classify these records as wanted or unwanted. Then, we clean the structure of wanted records and delete the unwanted records.

The Workflow of TAHARAT

The workflow of TAHARAT consists of three phases. In the first phase, it first uses the output of its predecessor SURAGH to classify input file records as ill-formed or well-formed using dominant row patterns . Then, it runs SURAGH incrementally for ill-formed records to obtain row patterns specifically for those rows; we call these patterns potential dominant row patterns, as these ill-formed data rows can possibly be transformed into well-formed data rows. TAHARAT repeats the incremental pattern generation process until no dominant, ill-formed records are left.  After the first phase, TAHARAT obtains dominant and potential dominant patterns for well-formed and ill-formed records, respectively. The second phase uses these patterns to classify ill-formed records into wanted and unwanted. In the final phase, TAHARAT collects wanted records, well-formed records, and their patterns from the previous phase and removes the unwanted records. It then uses the pattern transformation grammar to transform the wanted records into well-formed ones.

Data Preparation: A Survey of Commercial Tools


Raw data are often messy: they follow different encodings, records are not well structured, values ​​do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. To cater to this demand, companies and researchers are developing techniques and tools for data preparation.
To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.

The Need For Data Preparation

Raw data appears in many situations: logs, sensor output, government data, medical research data, climate data, geospatial data, etc. It accumulates in many places, such as file systems, data lakes or online repositories. 
In typical scenarios, raw data from various sources is accrued without any standardized formats or structure and with no specific target use-case; thus, it can appear messy, contain invalid characters, use different encodings, lack necessary columns, contain unwanted rows, have missing values, not follow valid patterns, etc. 

One example scenario in need of data preparation technology are data lakes to store heterogeneous raw data. They can turn into vast repositories or collections of unstructured, semi-structured, unformulated, messy, and unclean data. The large volumes of data in data lakes are compelling and can generate valuable information, provided they are thoroughly pre-processed, cleaned, and prepared. With the ever-increasing amount of raw data, the need for data preparation has become more apparent.

Preparing data yields many advantages, such as prompt error detection, improved analytics, improved data quality, enhanced scalability, accelerated data usage and more easy data collaboration.

Data Preparation vs. Data Cleaning

We define data preparation as the set of preprocessing operations performed in early stages of a data processing pipeline, i.e., data transformationsat the structural and syntactical levels. In contrast, data cleaning concerns sub-sequent data transformations and corrections at the semantic level.

Data Preparation Tasks

Data preparation is not a single step process. Rather, it usually comprises many individual preparation steps, implemented by what we call preparators, and which we have organized anew into six broader categories, defined here.

Data discovery is the process of analyzing and collecting data from different sources, for instance to match data patterns, find missing data, and locate outliers.

Data validation comprises rules and constraints to inspect the data, for instance for correctness, completeness, and other data quality constraints.

Data structuring encompasses tasks for the creation, representation and structuring of information. Examples include updating schema, detecting and changing encoding and, transform data by example.

Data enrichment adds value or supplementary information to existing data from separate sources. Typically, it involves augmenting existing data with new or derived data values using data lookups, primary key generation, and inserting metadata.

Data filtering generates a subset of the data under consideration, facilitating manual inspection and removing irregular data rows or values. Examples include extracting text parts, and keeping or deleting filtered rows.

Data cleaning refers to removal, addition, or replacement of less accurate or inaccurate data values with more suitable, accurate or representative values. Typical examples are deduplication, fill missing values, and removing whitespace.

Despite our definition, which distinguishes data preparation and cleaning, we include data cleaning steps here as well, as most data preparation tools also venture into this area.

This Project Makes the Following Contributions:

1. Organisation: We propose six broad categories of data preparation and identify 40 common data preparation steps, which we classify into those categories.
2. Documentation: We validate the availability of these features and broader categories for seven selected tools and document them in a feature matrix.
3. Evaluation: We evaluate the selected features of surveyed tools to identify whether the tool offers the stated functionalities or not.
4. Recommendation: We identify shortcomings of commercial data preparation tools in general and encourage researchers to explore further in the field of data preparation.

Discovered Tools with Asserted Data Preparation Capabilities

We collected notable commercial data preparation tools gathered from business reports and analyses, company portals, and online demonstration videos. Our preliminary investigation resulted in 42 initial commercial tools , which we then examined for the extent of their data preparation capabilities.

Selected Data Preparation Tools

Criteria for Selected Tools:

1. Domain specificity: tools that specifically address the data preparation task.
2. Comprehensiveness: the extent and sophistication to which tools adequately covered preparation features.
3. Guides and documentation: the availability of proper documentation for the tools, i.e., useful, up-to-date documentation with listings of features and how-to guides.
4. Trial availability: the availability of a trial version, giving us the opportunity to test the tools and validate their features.
5. GUI: the availability of a comprehensive and intuitive graphical user interface to select and apply preparations.
6. Customer assistance: compliant support teams that assisted users with generic and specific tool queries, when needed.

Preparator Matrix

Table below provides a feature matrix showing which preparator is supported by which tool in each of the six categories.

Further Features of Data Preparation Tools

In our survey of commercial tools, we came across many functional and non-functional system features that did not cater to our data preparation focus. Nonetheless, these features are important and interesting to explore and use. Therefore, we have collected them as well.


  • G. Vitagliano, M. Hameed, F. Naumann: Structural embedding of data files with MaGRiTTE. Table Representation Learning Workshop at NeurIPS (TRL@NIPS), 2022.
  • G. Vitagliano, L. Reisener, L. Jiang, M. Hameed, F. Naumann: Mondrian: Spreadsheet Layout Detection. Proceedings of the International Conference on Management of Data (SIGMOD), 2022 
  • M. Hameed, G. Vitagliano, L. Jiang, F. Naumann: SURAGH: Syntactic Pattern Matching to Identify Ill-Formed Records. Proceedings of the International Conference on Extending Database Technology (EDBT), 2022 
  • L. Jiang, G. Vitagliano, M. Hameed, F. Naumann: Aggregation Detection in CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT), 2022 
  • M. Hameed, F. Naumann: Data Preparation: A Survey of Commercial Tools. SIGMOD Record 49:(3), 2020
    [Paper]  [ACM]