Information Systems Group
Hasso Plattner Institute
Office: Prof.-Dr.-Helmert-Straße 2-3 D-14482 Potsdam Room: F-2.05
Tel.: +49 331 5509 274
Supervisor: Prof. Dr. Felix Naumann
Typically for any data driven application data loading is the initial process, even for data cleaning and preparation the data first must be successfully loaded. However, most data collections are prone to errors that can in turn cause hindrance in data loading. These issues generally stem from pre-exisiting errors in data files such as, (1) empty or null values, (2) out of range values, (3) inconsistent column data and (4) misplaced delimiters to name a few. To address issues pertaining to data load in data stream applications, we propose a methodology that aims to utilize data patterns and identify irregularities by finding (1) common or frequent data patterns and, (2) irregular or unique data patterns. The reasoning for a record or column to be identified as “incocnsitent” or “out of form” from the rest of the data, the record pattern should be different than the rest of the record patterns or exist in its group of patterns.
Normally, records that don’t resonate with the rest of the data are one of the contributing factors in data loading issues. In context of our research or by our definition, such records are identified as “messy records”. For a record to be regarded as “messy”, it is not necessary that all columns and cell values of that record should be irregular. It is possible for a record to have a single value out of order and context that can cause for this record to stand out as an outlier from the rest of the data.
Our research is to identify and tackle such messy records by proposing a technique that is not reliant on external information, such as field data types, record structure, file dialect, etc.