Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Ioannis Koumarelas

I am a Ph.D. student at the Infomation Systems Research Group and my research started in collaboration with SAP and SAP Concur. Through my Ph.D. I have worked in the general area of Data Cleaning, Data Preparation, with my main focus on Duplicate Detection.

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam
Office: F-2.05, Campus II

Phone: +49 331 5509 1377
Email:  Ioannis Koumarelas (click)
Research: GoogleScholar, ResearchGate, DBLP
Profiles: LinkedIn, GitHub

Research Interests

  • Duplicate Detection (Record Linkage, Entity Resolution etc.), Data Cleaning, Data Preparation
  • Address Geocoding
  • Parallel and Distributed Systems, Big Data Management
  • Data Profiling
  • Data Mining, Machine Learning, Deep Learning

Projects

Cooperation project with SAP and SAP Concur, for Vendor Data Cleaning of hotels. Our main task has been to apply Duplicate Detection, thus identify duplicates and understand what are their causes. The approaches we followed mainly use data preparation and matching dependencies, for which more information is further available through our publications.

Publication list

Experience: Enhancing Address Matching with Geocoding and Similarity Measure Selection

Koumarelas, Ioannis; Kroschk, Axel; Mosley, Clifford; Naumann, Felix in Journal of Data and Information Quality (JDIQ) 2018 .

Given a query record, record matching is the problem of finding database records that represent the same real-world object. In the easiest scenario, a database record is completely identical to the query. However, in most cases, problems do arise, for instance, as a result of data errors or data integrated from multiple sources or received from restrictive form fields. These problems are usually difficult, because they require a variety of actions, including field segmentation, decoding of values, and similarity comparisons, each requiring some domain knowledge. In this article, we study the problem of matching records that contain address information, including attributes such as Street-address and City. To facilitate this matching process, we propose a domain-specific procedure to, first, enrich each record with a more complete representation of the address information through geocoding and reverse-geocoding and, second, to select the best similarity measure per each address attribute that will finally help the classifier to achieve the best f-measure. We report on our experience in selecting geocoding services and discovering similarity measures for a concrete but common industry use-case.
Weitere Informationen
Tagssys:relevantfor:isg  hpi  isg