Multi-Source Data Matching and Clustering

Erhard Rahm, Uni Leipzig

Abstract

In modern big data systems, data is often combined from multiple data sources, this process is known as Data Integration. However, this data often contains duplicate entries that lead to unnecessary storage consumption, often resulting in additional computational effort. Identifying these duplicates is known as entity resolution and is a tedious task, as it requires matching different data to the same entity or object in the real world, which has high computational complexity. A further challenge of entity resolution and clustering in big data systems is that data sources may be continuously added and new entities might be collected.

In this summary, we present different scalable approaches on how to identify data that represent the same entity with an acceptable computational complexity. We further give an overview of state-of-the art tools that enable the process of entity resolution and clustering and describe the approach of incremental entity clustering that allows the handling of changing data sources and avoids expensive recomputation.

Biography

Prof. Dr. Erhard Rahm graduated and did his doctorate at TU Kaiserslautern in Computer Science. He currently is professor and head of the database group at the University of Leipzig, where he also is co-coordinator for the national Big Data Center ScaDS.AI (Center for Scalable Data Analytics and AI). ScaDS.AI researches in the fields of Big Data Analytics, AI and Machine Learning methods, and various application areas of Big Data and AI. At the same time, they also have a strong focus on the ethical and societal dimensions of Big Data and AI. Some of Rahm’s main fields of research are Data Quality and Data Integration. His work received awards including a VLDB Ten Year Best Paper Award or the ICDE 2013 Influential Paper Award.

Lecture available on Teletask

Summary

Written by Alexandra Deutschenbaur & Rohan Sawahn

Data integration aims to combine data from different autonomous sources into a single view, thereby allowing uniform access to this data. Physical data integration physically combines the data within a new dataset or database to access and analyze the data. It is also a prerequisite for high-performance analysis, as the integration has happened beforehand, thus allowing fast access. This approach for data integration is most commonly used in data warehouses, knowledge graphs (representing the knowledge from diverse sources in a new coherent way, e.g. DBpedia, yagGO, Google knowledge graph), and most Big Data applications. Integrating data from different sources is an expensive process that consists of multiple steps, as depicted in Figure 1. In this summary, we will focus on the step of Entity resolution, as highlighted in Figure 1. Entity resolution is also commonly known as data matching or link discovery.

Figure 1: Data Integration Workflow (One possible order)

Challenges in Data Integration

There are many current challenges in data integration. One of them is data quality issues. This relates to data being heterogeneous and for example not structured or semi-structured data. Additionally, there is a large need for data cleaning and enrichment. Furthermore, as data sources often change and new data is added, incremental or dynamic approaches for data integration are required.

One significant issue also is data volume. Performing all the steps for millions of entities in large data sources turns out to be very time-consuming. Therefore, for entity resolution, the search space is reduced by blocking and solutions that support parallel processing with the usage of Hadoop clusters or that use GPUs are introduced.

Other challenges lie in the field of preserving privacy when integrating data from different sources (e.g the health sector) or finding effective approaches to return a high match quality, such as using supervised machine learning methods.

Holistic Data Integration

A further challenge in real-world applications is that numerous data sources, e.g. thousands of webshops and data lakes with thousands to millions of tables have to be integrated. A simple approach to integrate two different data sources would be a pairwise linking that then returns a mapping between two sources. However, scalability is not given to handle changes in the individual data sources. That is the reason for going further and clusters based on these pairwise matches are created. Each cluster has a cluster representative, the golden record, that represents the information in the cluster in a very compact way. With this more efficient approach, new entities no longer need to be compared with all others from all sources, but only with the cluster representatives.

Entity Resolution

In the following, we will focus on the entity resolution and clustering steps in the data integration workflow. Data matching known as Entity Resolution is basically the task of identifying semantically equivalent objects from disparate data sources.

The general workflow of the ER task starts with one or more data sources given as input. The first step of Blocking / Filtering reduces the number of comparisons. This step is followed by the comparison of the remaining candidates for match classification. This is done by binary matchers that compare pairs of entities that are likely to match. In many cases, the workflow ends at this point. However, the basic approach of the final entity clustering step is to determine the transitive completion or connected components via the match links. The final matches are determined by applying a clustering approach to the set of candidate pairs that form a similarity graph. The final result would be a set of clusters in which all entities match each other. In a special case, duplicate-free (clean) sources might already exist. In this case, with n sources, the cluster sizes are limited by n, as there cannot be more than one entity per clean data source due to the assumption of duplicate-free sources.

ER Tool Landscape

There are multiple tools available for aiding the entity resolution process, such as Magellan, JedAI, and FAMER. These all have different advantages, for example, some support blocking, matching, and clustering and have a graphical user interface, whilst others only have a subset of these features. Moreover, only FAMER supports incremental entity resolution and a non-commercial Big Data architecture (based on Apache Flink), thus making it viable for applications that need to scale.

FAMER stands for FAst Multi-source Entity Resolution system and is used for scalable linking and clustering for many sources. As shown in Figure 3, in the first linking step a similarity graph is built through blocking, pairwise computation of matches, and match classification. Then clustering is performed, resulting in the clustered set as output.

Entity Clustering for Clean and Mixed Sources (CLIP, MSCD-HAC)

In existing clustering methods there often is the problem of overlapping clusters. Additionally, they have the problem of resulting in source-inconsistent clusters, even when the source was clean (duplicate free). In such a case, each cluster should not have more than one entity per source. There is a new effective clustering approach called CLIP (CLustering based on Link Priority) which is under the relatively restrictive assumption of having only clean data sources. We basically consider the links in our similarity graph and classify every link as either strong (when there is a maximum similarity from both ends), normal (it is maximum from one end but not from the other), or weak (which means there is no maximum similarity from any of the two ends). CLIP utilizes the duplicate-free assumption and ignores weak links. With this, poor cluster decisions for connected components can be avoided. The focus is on strong links but normal ones are also considered.

The assumption of only clean data sources is not realistic in many cases. To deal with both kinds of sources and achieve better match quality, the MSCD (Multi-Source Clean Dirty) approaches were developed. Two new MSCD approaches that are also supported in FAMER and that do not deduplicate the data sources fist are based on Hierarchical Agglomerative Clustering (HAC) and Affinity Propagation (AP). MSCD-HAC is an iterative approach: Initially, each entity forms a cluster by itself. We then try to merge the most similar clusters iteratively as long as they are similar clusters around. We make sure that we merge results to observe this source consistency, so we do not have more than one entity of a clean source in a cluster. In addition, there are 3 approaches to determine cluster similarity: Single linkage (look only at the best similarity value for an entity in the clusters → optimistic approach), Complete linkage (minimum similarity needed, every pair of entities in the two clusters would have exceeded the threshold for similarity) and average linkage (average similarity must be above the threshold).

Incremental Entity Clustering and Repair

So far, we have just presented methods that allow multi-source entity resolution for static data sources. However, in real-world applications, we often have new data arriving constantly or have changing data sources, which need to be integrated into the existing clustered entities. To provide scalability and efficient use of available resources, new data should not require recomputation of the existing clustering. Additionally, the order in which data arrives should not impact the result, which especially is a problem if wrong clustering decisions were made early. We, therefore, require a mechanism to repair existing clusters, rather than just adding incoming data to the most similar.

FAMER is one of the few tools that natively support incremental clustering and repair. New entities that arrive are grouped amongst themselves and are linked to the existing entities. In the next step of the FAMER pipeline, the newly grouped entities are integrated into the clusters. We can distinguish between two methods to integrate the new groups of entities:

Non-repairing with Max-Both Merge (MBM): A new entity group is added to a similar, already existing cluster, if there exists an entry in the new group and an entry in the old cluster that both have a maximum similarity link and if the respective clusters of the entity-pair contain only entities from different sources. Otherwise, the new entity group forms a new cluster.
Repairing with n-Depth Reclustering (nDR): This approach considers a reclustering of existing clusters, thus allowing to repair existing clusters, to achieve a better cluster assignment for new entities. However, if everything would be reclustered for every new entity, this approach would not be efficient. Therefore, this approach only considers the n-depth neighbors that are linked in the similarity graph. A 1-depth reclustering would only consider direct neighbors, while a 2-depth reclustering for example would also consider the neighbors of the direct neighbors and so on.

In an evaluation, these approaches were benchmarked against existing solutions and the 1-depth reclustering approach was able to achieve almost the same results as the non-incremental approach, while at the same time being much faster. With this evaluation, it was shown that the quality of nDR does not depend on the order in which the entries arrive.

In summary, it can be said that with the rise of big data systems, the need for multi-source entity resolution is growing. Data integration still faces many challenges such as data quality, privacy support for continuous change, and automation of large-scale knowledge graphs.

With FAMER and the incremental clustering approach a huge step is already taken, nevertheless, numerous challenges, such as incremental entity resolution for many entity types or learning-based classification of new entities, are still open. So far, we have only looked at entity resolution of textual data. However, data often also consists of images or audio data, thus opening the need for further research in the field of multimodal entity resolution.

Sources

Famer Incremental Pipeline: Saeedi, Alieh. "Clustering Approaches for Multi-source Entity Resolution."

Figure 1: Slides Lecture by Prof. Erhard Rahm, https://www.tele-task.de/lecture/video/9028/

Figure 2: Slides Lecture by Prof. Erhard Rahm, https://www.tele-task.de/lecture/video/9028/

Figure 3: Slides Lecture by Prof. Erhard Rahm, https://www.tele-task.de/lecture/video/9028/

Figure 4: Famer Incremental Pipeline: Saeedi, Alieh. "Clustering Approaches for Multi-source Entity Resolution."

Multi-Source Data Matching and Clustering

Erhard Rahm, Uni Leipzig

Abstract

Biography

Summary

Written by Alexandra Deutschenbaur & Rohan Sawahn

Chair

News

09.08.2024 | Paper on Query Compilation for GPUs accepted at LWDA '24

18.07.2024 | Stork paper accepted at DATAI '24

08.03.2024 | CXL Buffer Management Paper Accepted at HardBD & Active '24

01.02.2024 | InferDB paper accepted at VLDB '24

01.02.2024 | POLAR paper accepted at VLDB '24

Events

24.03.2022 | FG DB Symposium

Directions