de.hpi.fgis.dude.algorithm.duplicatedetection
Class Lego

java.lang.Object
  extended by de.hpi.fgis.dude.util.AbstractCleanable
      extended by de.hpi.fgis.dude.algorithm.AbstractAlgorithm
          extended by de.hpi.fgis.dude.algorithm.AbstractDuplicateDetection
              extended by de.hpi.fgis.dude.algorithm.duplicatedetection.Lego
All Implemented Interfaces:
Algorithm, Cleanable, AutoJsonable, Iterable<DuDeObjectPair>

public class Lego
extends AbstractDuplicateDetection

Lego is an iterative blocking approach. In extension to the standard blocking algorithm duplicates are distributed to other blocks to find more duplicates. Those blocks are processed repeatedly until no new duplicates are found. The implementation is based on the original description of the algorithm in the paper "Entity Resolution with Iterative Blocking" by Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald and Hector Garcia-Molina (see http://ilpubs.stanford.edu:8090/915/).

Author:
Florian Thomas

Nested Class Summary
static class Lego.ComparisonResult
           
 class Lego.LegoIterator
           
 
Nested classes/interfaces inherited from class de.hpi.fgis.dude.algorithm.AbstractAlgorithm
AbstractAlgorithm.AlgorithmIteratorWrapper
 
Constructor Summary
protected Lego()
          For serialization.
  Lego(SortingKey... sortingKeys)
          Initializes Lego with the passed SortingKey's.
 
Method Summary
 void addDataSource(DataSource source)
          Adds a DataSource to the algorithm.
 void addSortingKey(SortingKey sortingKey)
          Adds a sorting key, which defines a blocking criterion
protected  Iterator<DuDeObjectPair> createIteratorInstance()
          Returns a new Iterator instance.
 Algorithm getCoreERAlgorithm()
          Returns the entity resolution algorithm that is used to process blocks internally
 Merger getMerger()
          Returns the merger that merges several DuDeObjects into one
protected  Lego.ComparisonResult getNotification()
          Returns the category that was set for the last processed pair.
 Set<SortingKey> getSortingKeys()
          Return the blocking criteria
 void notifyOfLatestComparisonResult(Lego.ComparisonResult comparisonResult)
          Notifies the algorithm, whether the latest object pair has been categorized as a duplicate or a non-duplicate
protected  void resetNotification()
          Resets the last notification
 void setCoreERAlgorithm(Algorithm coreERAlgorithm)
          Sets the entity resolution algorithm that is used to process blocks internally
 void setMerger(Merger merger)
          Sets the merger that merges several DuDeObjects into one
 void setSortingKeys(Set<SortingKey> sortingKeys)
          Defines the blocking criteria that shall be used for blocking
 
Methods inherited from class de.hpi.fgis.dude.algorithm.AbstractDuplicateDetection
addSource, dataSourceAttached, equals, getData, getDataSize, getMaximumPairCount, hashCode, iterator, preprocessData, unregisterDataSources
 
Methods inherited from class de.hpi.fgis.dude.algorithm.AbstractAlgorithm
addPreprocessor, addPreprocessor, analyzeDuDeObject, createStorage, dataExtracted, disableInMemoryProcessing, enableInMemoryProcessing, finishExtraction, finishPreprocessing, forceExtraction, getDataSize, getExtractedData, inMemoryProcessingEnabled
 
Methods inherited from class de.hpi.fgis.dude.util.AbstractCleanable
cleanUp, registerCleanable, registerCloseable
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface de.hpi.fgis.dude.util.Cleanable
cleanUp, registerCleanable, registerCloseable
 

Constructor Detail

Lego

protected Lego()
For serialization.


Lego

public Lego(SortingKey... sortingKeys)
Initializes Lego with the passed SortingKey's.

Parameters:
sortingKeys - The SortingKeys that are used for defining the blocks. All DuDeObjects having the same generated SortingKey will be include in one block.
Method Detail

createIteratorInstance

protected Iterator<DuDeObjectPair> createIteratorInstance()
Description copied from class: AbstractDuplicateDetection
Returns a new Iterator instance.

Specified by:
createIteratorInstance in class AbstractDuplicateDetection
Returns:
The Iterator instance.

addDataSource

public void addDataSource(DataSource source)
Adds a DataSource to the algorithm. Note: The data source must not contain merged objects. This will lead to malfunction of the Lego implementation.

Specified by:
addDataSource in interface Algorithm
Overrides:
addDataSource in class AbstractAlgorithm
Parameters:
source - The DataSource that shall be added.
Throws:
NullPointerException - If null was passed.

setCoreERAlgorithm

public void setCoreERAlgorithm(Algorithm coreERAlgorithm)
Sets the entity resolution algorithm that is used to process blocks internally

Parameters:
coreERAlgorithm - The algorithm for processing blocks. Using Lego as CER algorithm will not work!.

getCoreERAlgorithm

public Algorithm getCoreERAlgorithm()
Returns the entity resolution algorithm that is used to process blocks internally

Returns:
The CER algorithm for processing blocks

setMerger

public void setMerger(Merger merger)
Sets the merger that merges several DuDeObjects into one

Parameters:
merger - The merger that merges two objects into one. The Lego implementation relies on the assumption that the merged object contains all identifiers of the merged objects. Currently, the DefaultMerger is approved to work in the expected way. Other implementations may operate differently.

getMerger

public Merger getMerger()
Returns the merger that merges several DuDeObjects into one

Returns:
The merger

addSortingKey

public void addSortingKey(SortingKey sortingKey)
Adds a sorting key, which defines a blocking criterion

Parameters:
sortingKey - The sortingKey that shall be added

setSortingKeys

public void setSortingKeys(Set<SortingKey> sortingKeys)
Defines the blocking criteria that shall be used for blocking

Parameters:
sortingKeys - A set of sorting keys that represent the blocking criteria

getSortingKeys

public Set<SortingKey> getSortingKeys()
Return the blocking criteria

Returns:
A set of sorting keys that represent the blocking criteria

notifyOfLatestComparisonResult

public void notifyOfLatestComparisonResult(Lego.ComparisonResult comparisonResult)
Notifies the algorithm, whether the latest object pair has been categorized as a duplicate or a non-duplicate

Parameters:
comparisonResult - The category.

resetNotification

protected void resetNotification()
Resets the last notification


getNotification

protected Lego.ComparisonResult getNotification()
Returns the category that was set for the last processed pair.

Returns:
The category of the last processed pair.


Copyright © 2011 Hasso Plattner Institute - Chair of Information Systems. All Rights Reserved.