de.hpi.fgis.dude.postprocessor
Class ExtendedStatisticComponent

java.lang.Object
  extended by de.hpi.fgis.dude.postprocessor.StatisticComponent
      extended by de.hpi.fgis.dude.postprocessor.ExtendedStatisticComponent
All Implemented Interfaces:
AutoJsonable

public class ExtendedStatisticComponent
extends StatisticComponent
implements AutoJsonable

ExtendedStatisticComponent provides functionality for gathering statistics concerning different measures that can be realized with the Generalized Merge Distance (GMD). Up to this day, included are Precision, Recall, F1, Variation of Information, Basic Merge Distance, and a hybrid measure. Therefore, a collection of real duplicates has to be added. The GMD clusters the duplicate pairs for the computation and thus generates the transitive closure of the duplicates, which implies that the set of duplicate pairs detected is COMPLETE, otherwise the results of the GMD concerning Precision/Recall/F1 will DIFFER from the ones of the StatisticComponent (there the results are based on the actual amount of duplicate pairs detected)!

Author:
Cindy Faehnrich

Nested Class Summary
static class ExtendedStatisticComponent.Config
           
 
Field Summary
 Collection<DuDeObjectPair> duplicates
          Collection of all duplicate pairs of the comparisons.
 Vector<Vector<DuDeObject>> resultCluster
          Final ER result in cluster format.
 
Fields inherited from class de.hpi.fgis.dude.postprocessor.StatisticComponent
actualComparisonCount, algorithm, checkMemory, endDate, endTime, falseNegativesByComparison, falsePositives, falsePositivesByComparison, goldStandard, memoryCheckFrequency, NO_STATISTIC_VALUE, pairCount, startDate, startTime, task, timer, trueNegativesByComparison, truePositives, truePositivesByComparison
 
Constructor Summary
protected ExtendedStatisticComponent()
          Internal constructor for Jsonable deserialization.
  ExtendedStatisticComponent(Algorithm algorithm)
          Initializes an ExtendedStatisticComponent with no gold standard and default configuration for GMD.
  ExtendedStatisticComponent(GoldStandard goldStandard, Algorithm algorithm, ExtendedStatisticComponent.Config config)
          Initializes an ExtendedStatisticComponent using the passed DuDeObjectPairs as real duplicates.
 
Method Summary
 void activateDefault()
          Activates the usage of the Default GMD as evaluation measure.
 void activateF1()
          Activates the usage of PairwiseF1 (and by that Precision and Recall) as GMD evaluation measure.
 void activateHybrid()
          Activates the usage of Hybrid GMD as evaluation measure.
 void activatePRECISION()
          Activates the usage of Precision as GMD evaluation measure.
 void activateRECALL()
          Activates the usage of Recall as GMD evaluation measure.
 void activateVI()
          Activates the usage of the Variation of Information as GMD evaluation measure.
 void addDuplicate(DuDeObjectPair pair, boolean actualComparison)
          Adds a DuDeObjectPair to the knowledge base that is labeled as a detected duplicate and the gold standard's duplicate pairs.
 double computeF1(double precision, double recall)
          Computation of the f1 measure by calculating the harmonic mean of the given precision and recall.
 double computeGMD(Collection<? extends Collection<DuDeObject>> erCluster, Collection<? extends Collection<DuDeObject>> goldCluster)
          Returns the Generalized Merge Distance based on the current cost configurations.
 Vector<Vector<DuDeObject>> computeSingleClusters(Vector<Vector<DuDeObject>> clusteredRecords)
          Returns a list of clusters with each record alone in its cluster.
 void createResultCluster()
          Creates a Vector of Vector of DuDeObjects from the given data structure (Transform gold standard from duplicate pairs to cluster format).
 void deactivateDefault()
          Deactivates the usage of the Default GMD as evaluation measure.
 void deactivateF1()
          Deactivates the usage of PairwiseF1 as GMD evaluation measure.
 void deactivateHybrid()
          Deactivates the usage of Hybrid GMD as evaluation measure.
 void deactivatePRECISION()
          Deactivates the usage of Precision as GMD evaluation measure.
 void deactivateRECALL()
          Deactivates the usage of Recall as GMD evaluation measure.
 void deactivateVI()
          Deactivates the usage of the Variation of Information as GMD evaluation measure.
 int findCluster(HashMap<DuDeObject,Integer> m, DuDeObject item)
          Searches for the value of a given key in a hashmap by comparing their ids (from source and object)
 Vector<Vector<DuDeObject>> generateClustersForUnique(Vector<Vector<DuDeObject>> clusters)
          Generates for each unique item (that is not listed in the pairwise er result) its own cluster and adds it to the given clusters.
 ExtendedStatisticComponent.Config getConfig()
          Returns the current cost configuration set for GMD processing
 boolean hasGMD()
          Checks whether this StatisticComponent calculates the Generalized Merge Distance.
 double normalize(double val1, double val2)
          Returns a value normalized to 1 of the first given value, where the second value is the maximum.
 double processGMD()
          Processes the Generalized Merge Distance regarding the current configuration if it is activated.
 void setConfig(ExtendedStatisticComponent.Config config)
          Sets the cost configuration to use for GMD processing
 void setExtractedData()
          Sets the size of the extracted data.
 
Methods inherited from class de.hpi.fgis.dude.postprocessor.StatisticComponent
addDuplicate, addDuplicate, addNonDuplicate, addNonDuplicate, addNonDuplicate, addPair, addPair, checkMemoryUsage, getAverageMemoryUsed, getComparisonCount, getEndDate, getFalseNegatives, getFalseNegativesByComparison, getFalsePositives, getFalsePositivesByComparison, getFMeasure, getFMeasureByComparison, getMaximumMemoryUsed, getMemoryCheckFrequency, getMinimumMemoryUsed, getNumberOfCandidateComparisons, getNumberOfRealDuplicates, getObjectCount, getPairCount, getPrecision, getPrecisionByComparison, getRecall, getRecallByComparison, getReductionRatio, getReductionRatioByComparison, getRuntime, getStartDate, getTrueNegatives, getTrueNegativesByComparison, getTruePositives, getTruePositivesByComparison, goldStandardSet, isCheckMemory, isDuplicate, isNonDuplicate, reset, setBeginningTime, setCheckMemory, setEndTime, setFinishingTime, setMemoryCheckFrequency, setStartTime
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

resultCluster

public Vector<Vector<DuDeObject>> resultCluster
Final ER result in cluster format.


duplicates

public final transient Collection<DuDeObjectPair> duplicates
Collection of all duplicate pairs of the comparisons.

Constructor Detail

ExtendedStatisticComponent

public ExtendedStatisticComponent(Algorithm algorithm)
Initializes an ExtendedStatisticComponent with no gold standard and default configuration for GMD.

Parameters:
algorithm - The used algorithm.

ExtendedStatisticComponent

public ExtendedStatisticComponent(GoldStandard goldStandard,
                                  Algorithm algorithm,
                                  ExtendedStatisticComponent.Config config)
Initializes an ExtendedStatisticComponent using the passed DuDeObjectPairs as real duplicates.

Parameters:
goldStandard - The gold standard which these statistics are based on.
algorithm - Used algorithm.
config - Used configuration of GMD.

ExtendedStatisticComponent

protected ExtendedStatisticComponent()
Internal constructor for Jsonable deserialization.

Method Detail

activateF1

public void activateF1()
Activates the usage of PairwiseF1 (and by that Precision and Recall) as GMD evaluation measure.


deactivateF1

public void deactivateF1()
Deactivates the usage of PairwiseF1 as GMD evaluation measure.


activateRECALL

public void activateRECALL()
Activates the usage of Recall as GMD evaluation measure.


deactivateRECALL

public void deactivateRECALL()
Deactivates the usage of Recall as GMD evaluation measure.


activatePRECISION

public void activatePRECISION()
Activates the usage of Precision as GMD evaluation measure.


deactivatePRECISION

public void deactivatePRECISION()
Deactivates the usage of Precision as GMD evaluation measure.


activateHybrid

public void activateHybrid()
Activates the usage of Hybrid GMD as evaluation measure.


deactivateHybrid

public void deactivateHybrid()
Deactivates the usage of Hybrid GMD as evaluation measure.


activateDefault

public void activateDefault()
Activates the usage of the Default GMD as evaluation measure.


deactivateDefault

public void deactivateDefault()
Deactivates the usage of the Default GMD as evaluation measure.


activateVI

public void activateVI()
Activates the usage of the Variation of Information as GMD evaluation measure.


deactivateVI

public void deactivateVI()
Deactivates the usage of the Variation of Information as GMD evaluation measure.


setConfig

public void setConfig(ExtendedStatisticComponent.Config config)
Sets the cost configuration to use for GMD processing

Parameters:
config - ExtendedStatisticComponent.Config attribute.

getConfig

public ExtendedStatisticComponent.Config getConfig()
Returns the current cost configuration set for GMD processing

Returns:
The current configuration

hasGMD

public boolean hasGMD()
Checks whether this StatisticComponent calculates the Generalized Merge Distance. Important for the output components.

Overrides:
hasGMD in class StatisticComponent
Returns:
true since this is the extended component with GMD computation

setExtractedData

public void setExtractedData()
Sets the size of the extracted data. Necessary for generating the unique clusters for the gold standard.


computeGMD

public double computeGMD(Collection<? extends Collection<DuDeObject>> erCluster,
                         Collection<? extends Collection<DuDeObject>> goldCluster)
Returns the Generalized Merge Distance based on the current cost configurations.

Parameters:
goldCluster - The Gold Standard in cluster format
erCluster - The questionable ER result in cluster format
Returns:
The GMD for the given cluster sets and configurations.

findCluster

public int findCluster(HashMap<DuDeObject,Integer> m,
                       DuDeObject item)
Searches for the value of a given key in a hashmap by comparing their ids (from source and object)

Parameters:
m - The hashmap that assigns a cluster id to each record
item - The DuDeObject whose cluster id in R is searched for
Returns:
The cluster id the given record has in R.

normalize

public double normalize(double val1,
                        double val2)
Returns a value normalized to 1 of the first given value, where the second value is the maximum. Needed for the computation of precision and recall.

Parameters:
val1 - Float value to normalize.
val2 - Maximal possible value val1 could reach.
Returns:
The to 1 normalized value of val1.

computeF1

public double computeF1(double precision,
                        double recall)
Computation of the f1 measure by calculating the harmonic mean of the given precision and recall.

Parameters:
precision - Precision value (normalized to 1).
recall - Recall value (normalized to 1).
Returns:
The pairwise f1 measure value (between 0 and 1).

computeSingleClusters

public Vector<Vector<DuDeObject>> computeSingleClusters(Vector<Vector<DuDeObject>> clusteredRecords)
Returns a list of clusters with each record alone in its cluster. Built from the current gold standard


processGMD

public double processGMD()
Processes the Generalized Merge Distance regarding the current configuration if it is activated.

Returns:
The final costs needed to transform the ER result to the gold standard.

createResultCluster

public void createResultCluster()
Creates a Vector of Vector of DuDeObjects from the given data structure (Transform gold standard from duplicate pairs to cluster format).


generateClustersForUnique

public Vector<Vector<DuDeObject>> generateClustersForUnique(Vector<Vector<DuDeObject>> clusters)
Generates for each unique item (that is not listed in the pairwise er result) its own cluster and adds it to the given clusters.

Parameters:
clusters - The ER result in cluster format, without the unique items.
Returns:
The ER standard in cluster format, with the unique items in their own cluster.

addDuplicate

public void addDuplicate(DuDeObjectPair pair,
                         boolean actualComparison)
Adds a DuDeObjectPair to the knowledge base that is labeled as a detected duplicate and the gold standard's duplicate pairs.

Overrides:
addDuplicate in class StatisticComponent
Parameters:
pair - A detected duplicate.
actualComparison - true, if the pair should be counted as comparison; otherwise false.


Copyright © 2011 Hasso Plattner Institute - Chair of Information Systems. All Rights Reserved.