|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object de.hpi.fgis.dude.util.AbstractCleanable de.hpi.fgis.dude.algorithm.AbstractAlgorithm de.hpi.fgis.dude.algorithm.AbstractDuplicateDetection de.hpi.fgis.dude.algorithm.duplicatedetection.Lego
public class Lego
Lego
is an iterative blocking approach. In extension to the standard blocking algorithm
duplicates are distributed to other blocks to find more duplicates. Those blocks are processed
repeatedly until no new duplicates are found.
The implementation is based on the original description of the algorithm in the paper
"Entity Resolution with Iterative Blocking" by Steven Euijong Whang, David Menestrina, Georgia Koutrika,
Martin Theobald and Hector Garcia-Molina (see http://ilpubs.stanford.edu:8090/915/).
Nested Class Summary | |
---|---|
static class |
Lego.ComparisonResult
|
class |
Lego.LegoIterator
|
Nested classes/interfaces inherited from class de.hpi.fgis.dude.algorithm.AbstractAlgorithm |
---|
AbstractAlgorithm.AlgorithmIteratorWrapper |
Constructor Summary | |
---|---|
protected |
Lego()
For serialization. |
|
Lego(SortingKey... sortingKeys)
Initializes Lego with the passed SortingKey 's. |
Method Summary | |
---|---|
void |
addDataSource(DataSource source)
Adds a DataSource to the algorithm. |
void |
addSortingKey(SortingKey sortingKey)
Adds a sorting key, which defines a blocking criterion |
protected Iterator<DuDeObjectPair> |
createIteratorInstance()
Returns a new Iterator instance. |
Algorithm |
getCoreERAlgorithm()
Returns the entity resolution algorithm that is used to process blocks internally |
Merger |
getMerger()
Returns the merger that merges several DuDeObjects into one |
protected Lego.ComparisonResult |
getNotification()
Returns the category that was set for the last processed pair. |
Set<SortingKey> |
getSortingKeys()
Return the blocking criteria |
void |
notifyOfLatestComparisonResult(Lego.ComparisonResult comparisonResult)
Notifies the algorithm, whether the latest object pair has been categorized as a duplicate or a non-duplicate |
protected void |
resetNotification()
Resets the last notification |
void |
setCoreERAlgorithm(Algorithm coreERAlgorithm)
Sets the entity resolution algorithm that is used to process blocks internally |
void |
setMerger(Merger merger)
Sets the merger that merges several DuDeObjects into one |
void |
setSortingKeys(Set<SortingKey> sortingKeys)
Defines the blocking criteria that shall be used for blocking |
Methods inherited from class de.hpi.fgis.dude.algorithm.AbstractDuplicateDetection |
---|
addSource, dataSourceAttached, equals, getData, getDataSize, getMaximumPairCount, hashCode, iterator, preprocessData, unregisterDataSources |
Methods inherited from class de.hpi.fgis.dude.algorithm.AbstractAlgorithm |
---|
addPreprocessor, addPreprocessor, analyzeDuDeObject, createStorage, dataExtracted, disableInMemoryProcessing, enableInMemoryProcessing, finishExtraction, finishPreprocessing, forceExtraction, getDataSize, getExtractedData, inMemoryProcessingEnabled |
Methods inherited from class de.hpi.fgis.dude.util.AbstractCleanable |
---|
cleanUp, registerCleanable, registerCloseable |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface de.hpi.fgis.dude.util.Cleanable |
---|
cleanUp, registerCleanable, registerCloseable |
Constructor Detail |
---|
protected Lego()
public Lego(SortingKey... sortingKeys)
Lego
with the passed SortingKey
's.
sortingKeys
- The SortingKeys
that are used for defining the blocks. All DuDeObject
s having the same generated
SortingKey
will be include in one block.Method Detail |
---|
protected Iterator<DuDeObjectPair> createIteratorInstance()
AbstractDuplicateDetection
Iterator
instance.
createIteratorInstance
in class AbstractDuplicateDetection
Iterator
instance.public void addDataSource(DataSource source)
DataSource
to the algorithm. Note: The data source must not contain merged objects.
This will lead to malfunction of the Lego implementation.
addDataSource
in interface Algorithm
addDataSource
in class AbstractAlgorithm
source
- The DataSource
that shall be added.
NullPointerException
- If null
was passed.public void setCoreERAlgorithm(Algorithm coreERAlgorithm)
coreERAlgorithm
- The algorithm for processing blocks. Using Lego as CER algorithm will not work!.public Algorithm getCoreERAlgorithm()
public void setMerger(Merger merger)
merger
- The merger that merges two objects into one. The Lego implementation relies
on the assumption that the merged object contains all identifiers of the merged
objects. Currently, the DefaultMerger
is approved to work in the expected
way. Other implementations may operate differently.public Merger getMerger()
public void addSortingKey(SortingKey sortingKey)
sortingKey
- The sortingKey that shall be addedpublic void setSortingKeys(Set<SortingKey> sortingKeys)
sortingKeys
- A set of sorting keys that represent the blocking criteriapublic Set<SortingKey> getSortingKeys()
public void notifyOfLatestComparisonResult(Lego.ComparisonResult comparisonResult)
comparisonResult
- The category.protected void resetNotification()
protected Lego.ComparisonResult getNotification()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |