|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object de.hpi.fgis.dude.postprocessor.StatisticComponent de.hpi.fgis.dude.postprocessor.ExtendedStatisticComponent
public class ExtendedStatisticComponent
ExtendedStatisticComponent
provides functionality for gathering statistics concerning different
measures that can be realized with the Generalized Merge Distance (GMD). Up to this day, included are Precision,
Recall, F1, Variation of Information, Basic Merge Distance, and a hybrid measure. Therefore, a collection
of real duplicates has to be added.
The GMD clusters the duplicate pairs for the computation and thus generates the transitive closure of the
duplicates, which implies that the set of duplicate pairs detected is COMPLETE, otherwise the results of
the GMD concerning Precision/Recall/F1 will DIFFER from the ones of the StatisticComponent
(there the results are based on the actual amount of duplicate pairs detected)!
Nested Class Summary | |
---|---|
static class |
ExtendedStatisticComponent.Config
|
Field Summary | |
---|---|
Collection<DuDeObjectPair> |
duplicates
Collection of all duplicate pairs of the comparisons. |
Vector<Vector<DuDeObject>> |
resultCluster
Final ER result in cluster format. |
Fields inherited from class de.hpi.fgis.dude.postprocessor.StatisticComponent |
---|
actualComparisonCount, algorithm, checkMemory, endDate, endTime, falseNegativesByComparison, falsePositives, falsePositivesByComparison, goldStandard, memoryCheckFrequency, NO_STATISTIC_VALUE, pairCount, startDate, startTime, task, timer, trueNegativesByComparison, truePositives, truePositivesByComparison |
Constructor Summary | |
---|---|
protected |
ExtendedStatisticComponent()
Internal constructor for Jsonable deserialization. |
|
ExtendedStatisticComponent(Algorithm algorithm)
Initializes an ExtendedStatisticComponent with no gold standard and
default configuration for GMD. |
|
ExtendedStatisticComponent(GoldStandard goldStandard,
Algorithm algorithm,
ExtendedStatisticComponent.Config config)
Initializes an ExtendedStatisticComponent using the passed DuDeObjectPair s as real duplicates. |
Method Summary | |
---|---|
void |
activateDefault()
Activates the usage of the Default GMD as evaluation measure. |
void |
activateF1()
Activates the usage of PairwiseF1 (and by that Precision and Recall) as GMD evaluation measure. |
void |
activateHybrid()
Activates the usage of Hybrid GMD as evaluation measure. |
void |
activatePRECISION()
Activates the usage of Precision as GMD evaluation measure. |
void |
activateRECALL()
Activates the usage of Recall as GMD evaluation measure. |
void |
activateVI()
Activates the usage of the Variation of Information as GMD evaluation measure. |
void |
addDuplicate(DuDeObjectPair pair,
boolean actualComparison)
Adds a DuDeObjectPair to the knowledge base that is labeled as a detected duplicate and
the gold standard's duplicate pairs. |
double |
computeF1(double precision,
double recall)
Computation of the f1 measure by calculating the harmonic mean of the given precision and recall. |
double |
computeGMD(Collection<? extends Collection<DuDeObject>> erCluster,
Collection<? extends Collection<DuDeObject>> goldCluster)
Returns the Generalized Merge Distance based on the current cost configurations. |
Vector<Vector<DuDeObject>> |
computeSingleClusters(Vector<Vector<DuDeObject>> clusteredRecords)
Returns a list of clusters with each record alone in its cluster. |
void |
createResultCluster()
Creates a Vector of Vector of DuDeObjects from the given data structure (Transform gold standard from duplicate pairs to cluster format). |
void |
deactivateDefault()
Deactivates the usage of the Default GMD as evaluation measure. |
void |
deactivateF1()
Deactivates the usage of PairwiseF1 as GMD evaluation measure. |
void |
deactivateHybrid()
Deactivates the usage of Hybrid GMD as evaluation measure. |
void |
deactivatePRECISION()
Deactivates the usage of Precision as GMD evaluation measure. |
void |
deactivateRECALL()
Deactivates the usage of Recall as GMD evaluation measure. |
void |
deactivateVI()
Deactivates the usage of the Variation of Information as GMD evaluation measure. |
int |
findCluster(HashMap<DuDeObject,Integer> m,
DuDeObject item)
Searches for the value of a given key in a hashmap by comparing their ids (from source and object) |
Vector<Vector<DuDeObject>> |
generateClustersForUnique(Vector<Vector<DuDeObject>> clusters)
Generates for each unique item (that is not listed in the pairwise er result) its own cluster and adds it to the given clusters. |
ExtendedStatisticComponent.Config |
getConfig()
Returns the current cost configuration set for GMD processing |
boolean |
hasGMD()
Checks whether this StatisticComponent calculates the Generalized Merge Distance. |
double |
normalize(double val1,
double val2)
Returns a value normalized to 1 of the first given value, where the second value is the maximum. |
double |
processGMD()
Processes the Generalized Merge Distance regarding the current configuration if it is activated. |
void |
setConfig(ExtendedStatisticComponent.Config config)
Sets the cost configuration to use for GMD processing |
void |
setExtractedData()
Sets the size of the extracted data. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public Vector<Vector<DuDeObject>> resultCluster
public final transient Collection<DuDeObjectPair> duplicates
Constructor Detail |
---|
public ExtendedStatisticComponent(Algorithm algorithm)
ExtendedStatisticComponent
with no gold standard and
default configuration for GMD.
algorithm
- The used algorithm.public ExtendedStatisticComponent(GoldStandard goldStandard, Algorithm algorithm, ExtendedStatisticComponent.Config config)
ExtendedStatisticComponent
using the passed DuDeObjectPair
s as real duplicates.
goldStandard
- The gold standard which these statistics are based on.algorithm
- Used algorithm.config
- Used configuration of GMD.protected ExtendedStatisticComponent()
Jsonable
deserialization.
Method Detail |
---|
public void activateF1()
public void deactivateF1()
public void activateRECALL()
public void deactivateRECALL()
public void activatePRECISION()
public void deactivatePRECISION()
public void activateHybrid()
public void deactivateHybrid()
public void activateDefault()
public void deactivateDefault()
public void activateVI()
public void deactivateVI()
public void setConfig(ExtendedStatisticComponent.Config config)
config
- ExtendedStatisticComponent.Config
attribute.public ExtendedStatisticComponent.Config getConfig()
public boolean hasGMD()
hasGMD
in class StatisticComponent
public void setExtractedData()
public double computeGMD(Collection<? extends Collection<DuDeObject>> erCluster, Collection<? extends Collection<DuDeObject>> goldCluster)
goldCluster
- The Gold Standard in cluster formaterCluster
- The questionable ER result in cluster format
public int findCluster(HashMap<DuDeObject,Integer> m, DuDeObject item)
m
- The hashmap that assigns a cluster id to each recorditem
- The DuDeObject whose cluster id in R is searched for
public double normalize(double val1, double val2)
val1
- Float value to normalize.val2
- Maximal possible value val1 could reach.
public double computeF1(double precision, double recall)
precision
- Precision value (normalized to 1).recall
- Recall value (normalized to 1).
public Vector<Vector<DuDeObject>> computeSingleClusters(Vector<Vector<DuDeObject>> clusteredRecords)
public double processGMD()
public void createResultCluster()
public Vector<Vector<DuDeObject>> generateClustersForUnique(Vector<Vector<DuDeObject>> clusters)
clusters
- The ER result in cluster format, without the unique items.
public void addDuplicate(DuDeObjectPair pair, boolean actualComparison)
DuDeObjectPair
to the knowledge base that is labeled as a detected duplicate and
the gold standard's duplicate pairs.
addDuplicate
in class StatisticComponent
pair
- A detected duplicate.actualComparison
- true
, if the pair should be counted as comparison; otherwise false
.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |