de.hpi.fgis.dude.postprocessor
Class StatisticComponent

java.lang.Object
  extended by de.hpi.fgis.dude.postprocessor.StatisticComponent
All Implemented Interfaces:
AutoJsonable
Direct Known Subclasses:
ExtendedStatisticComponent

public class StatisticComponent
extends Object
implements AutoJsonable

StatisticComponent provides functionality for gathering statistics concerning the recall, precision and f-measure. Therefore a collection real duplicates has to be added.

Author:
Matthias Pohl

Field Summary
protected  long actualComparisonCount
           
protected  Algorithm algorithm
           
protected  boolean checkMemory
           
protected  Date endDate
           
protected  long endTime
           
protected  long falseNegativesByComparison
           
protected  long falsePositives
           
protected  long falsePositivesByComparison
           
protected  GoldStandard goldStandard
           
protected  long memoryCheckFrequency
           
protected static int NO_STATISTIC_VALUE
           
protected  long pairCount
           
protected  Date startDate
           
protected  long startTime
           
protected  MemoryCheckerTask task
           
protected  Timer timer
           
protected  long trueNegativesByComparison
           
protected  long truePositives
           
protected  long truePositivesByComparison
           
 
Constructor Summary
protected StatisticComponent()
          Internal constructor for Jsonable deserialization.
  StatisticComponent(Algorithm algorithm)
          Initializes a StatisticComponent with no gold standard.
  StatisticComponent(GoldStandard goldStandard, Algorithm algorithm)
          Initializes a StatisticComponent using the passed DuDeObjectPairs as real duplicates.
 
Method Summary
 void addDuplicate(DuDeObjectPair pair)
          Adds a DuDeObjectPair to the knowledge base that is labeled as a detected duplicate.
 void addDuplicate(DuDeObjectPair pair, boolean actualComparison)
          Adds a DuDeObjectPair to the knowledge base that is labeled as a detected duplicate.
 void addDuplicate(Iterable<DuDeObjectPair> pairs)
          Adds several DuDeObjectPairs to the knowledge base that are labeled as detected duplicates.
 void addNonDuplicate(DuDeObjectPair pair)
          Adds a DuDeObjectPair to the knowledge base that is labeled as a detected non-duplicate.
 void addNonDuplicate(DuDeObjectPair pair, boolean actualComparison)
          Adds a DuDeObjectPair to the knowledge base that is labeled as a detected non-duplicate.
 void addNonDuplicate(Iterable<DuDeObjectPair> pairs)
          Adds several DuDeObjectPairs to the knowledge base that are labeled as a detected non-duplicates.
 void addPair(DuDeObjectPair pair, boolean positive)
          Adds a DuDeObjectPair to the knowledge base.
 void addPair(Iterable<DuDeObjectPair> pairs, boolean positive)
          Adds several DuDeObjectPairs to the knowledge base.
protected  void checkMemoryUsage()
          Starts Memoryusage task
 String getAverageMemoryUsed()
          Gets the registered average amount of memory used during the experiment.
 long getComparisonCount()
          Returns the number of pairs that were already compared.
 Date getEndDate()
          Gets the date of the specified end time of an algorithm.
 long getFalseNegatives()
          Returns the false negatives count.
 long getFalseNegativesByComparison()
          Returns the false negatives count that are explicitly classified by the comparator.
 long getFalsePositives()
          Returns the false positives count.
 long getFalsePositivesByComparison()
          Returns the false positives count that are explicitly classified by the comparator..
 double getFMeasure()
          Returns the f-measure based on the current knowledge base.
 double getFMeasureByComparison()
          Returns the f-measure based on the current knowledge base and the actual comparisons.
 String getMaximumMemoryUsed()
          Gets the registered maximum amount of memory during the experiment.
 long getMemoryCheckFrequency()
          Gets the frequency of memory checks.
 String getMinimumMemoryUsed()
          Gets the registered minimum amount of memory during the experiment.
 long getNumberOfCandidateComparisons()
          Returns the maximum number of pairs that would be generated by the naive approach.
 long getNumberOfRealDuplicates()
          Returns the size of the gold standard.
 long getObjectCount()
          Returns the number of records that were processed by the algorithm.
 long getPairCount()
          Returns the number of pairs that were already considered.
 double getPrecision()
          Returns the precision based on the current knowledge base.
 double getPrecisionByComparison()
          Returns the precision based on the current knowledge base and the actual comparisons.
 double getRecall()
          Returns the recall based on the current knowledge base.
 double getRecallByComparison()
          Returns the recall based on the current knowledge base and the actual comparisons.
 double getReductionRatio()
          Returns the reduction ratio based on the current knowledge base.
 double getReductionRatioByComparison()
          Returns the reduction ratio based on the current knowledge base and the actual comparisons.
 long getRuntime()
          Gets the time difference between beginning time and finishing time.
 Date getStartDate()
          Gets the date of the specified start time of an algorithm.
 long getTrueNegatives()
          Returns the true negatives count.
 long getTrueNegativesByComparison()
          Returns the true negatives count that are explicitly classified by the comparator.
 long getTruePositives()
          Returns the true positives count.
 long getTruePositivesByComparison()
          Returns the true positives count that are explicitly classified by the comparator..
 boolean goldStandardSet()
          Checks whether a gold standard was passed.
 boolean hasGMD()
          Checks whether this StatisticComponent calculates the Generalized Merge Distance.
 boolean isCheckMemory()
          Gets the boolean flag that indicates the activation status of memory checking.
 boolean isDuplicate(DuDeObjectPair pair)
          Returns true if the DuDeObjectPair exists in the set of real duplicate pairs.
 boolean isNonDuplicate(DuDeObjectPair pair)
          Checks whether a specific pair exists in the set of real duplicate pairs.
 void reset()
          Sets the attributes for TruePositives, FalsePositives, TruePositivesByComparison, FalsePositivesByComparison, TrueNegativesByComparison, FalseNegativesByComparison, PairCount and ComparisonCount to 0.
 void setBeginningTime()
          Deprecated. Replaced by setStartTime()
 void setCheckMemory(boolean checkMemory)
          Sets the boolean flag that indicates the activation status of memory checking.
 void setEndTime()
          Sets current time as finishing time for the runtime.
 void setFinishingTime()
          Deprecated. Replaced by setEndTime()
 void setMemoryCheckFrequency(long memoryCheckFrequency)
          Gets the frequency of memory checks.
 void setStartTime()
          Sets the current time as starting time for the runtime and initiates memory monitoring.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NO_STATISTIC_VALUE

protected static final int NO_STATISTIC_VALUE
See Also:
Constant Field Values

goldStandard

protected GoldStandard goldStandard

truePositives

protected long truePositives

falsePositives

protected long falsePositives

truePositivesByComparison

protected long truePositivesByComparison

falsePositivesByComparison

protected long falsePositivesByComparison

trueNegativesByComparison

protected long trueNegativesByComparison

falseNegativesByComparison

protected long falseNegativesByComparison

pairCount

protected long pairCount

actualComparisonCount

protected long actualComparisonCount

startTime

protected long startTime

startDate

protected Date startDate

endTime

protected long endTime

endDate

protected Date endDate

algorithm

protected Algorithm algorithm

checkMemory

protected boolean checkMemory

task

protected transient MemoryCheckerTask task

timer

protected transient Timer timer

memoryCheckFrequency

protected long memoryCheckFrequency
Constructor Detail

StatisticComponent

public StatisticComponent(Algorithm algorithm)
Initializes a StatisticComponent with no gold standard.

Parameters:
algorithm - The used algorithm.

StatisticComponent

public StatisticComponent(GoldStandard goldStandard,
                          Algorithm algorithm)
Initializes a StatisticComponent using the passed DuDeObjectPairs as real duplicates.

Parameters:
goldStandard - The gold standard which these statistics are based on.
algorithm - Used algorithm.

StatisticComponent

protected StatisticComponent()
Internal constructor for Jsonable deserialization.

Method Detail

hasGMD

public boolean hasGMD()
Checks whether this StatisticComponent calculates the Generalized Merge Distance. Important for the output components.

Returns:
false since this is the original component without GMD computation

goldStandardSet

public boolean goldStandardSet()
Checks whether a gold standard was passed. If no gold standard was set, some of the statistics cannot be calculated.

Returns:
true, if a gold standard was set; otherwise false.

getTruePositives

public long getTruePositives()
Returns the true positives count.

Returns:
The true positives count.

getFalsePositives

public long getFalsePositives()
Returns the false positives count.

Returns:
The false positives count.

getTrueNegatives

public long getTrueNegatives()
Returns the true negatives count.

Returns:
The true negatives count.

getFalseNegatives

public long getFalseNegatives()
Returns the false negatives count.

Returns:
The false negatives count.

getTruePositivesByComparison

public long getTruePositivesByComparison()
Returns the true positives count that are explicitly classified by the comparator..

Returns:
The true positives count that are explicitly classified by the comparator..

getFalsePositivesByComparison

public long getFalsePositivesByComparison()
Returns the false positives count that are explicitly classified by the comparator..

Returns:
The false positives count that are explicitly classified by the comparator..

getTrueNegativesByComparison

public long getTrueNegativesByComparison()
Returns the true negatives count that are explicitly classified by the comparator.

Returns:
The true negatives count that are explicitly classified by the comparator.

getFalseNegativesByComparison

public long getFalseNegativesByComparison()
Returns the false negatives count that are explicitly classified by the comparator.

Returns:
The false negatives count that are explicitly classified by the comparator.

getPairCount

public long getPairCount()
Returns the number of pairs that were already considered. These are all pairs, including those that should not be counted as comparison (e.g. those created by building the transitive closure).

Returns:
The number of pairs that were already considered.

getComparisonCount

public long getComparisonCount()
Returns the number of pairs that were already compared. These are only those pairs that should be counted as comparison (e.g. NOT those created by building the transitive closure).

Returns:
The number of pairs that were already compared.

addPair

public void addPair(DuDeObjectPair pair,
                    boolean positive)
Adds a DuDeObjectPair to the knowledge base. The pair is counted as comparison.

Parameters:
pair - The pair that shall be considered in the statistics.
positive - true, if the passed pair was detected as an duplicate; otherwise false.

addPair

public void addPair(Iterable<DuDeObjectPair> pairs,
                    boolean positive)
Adds several DuDeObjectPairs to the knowledge base. The pairs are counted as comparisons.

Parameters:
pairs - The pairs that shall be considered in the statistics.
positive - true, if the passed pair was detected as an duplicate; otherwise false.

addDuplicate

public void addDuplicate(DuDeObjectPair pair)
Adds a DuDeObjectPair to the knowledge base that is labeled as a detected duplicate. The pair is counted as comparison.

Parameters:
pair - A detected duplicate.

addDuplicate

public void addDuplicate(DuDeObjectPair pair,
                         boolean actualComparison)
Adds a DuDeObjectPair to the knowledge base that is labeled as a detected duplicate.

Parameters:
pair - A detected duplicate.
actualComparison - true, if the pair should be counted as comparison; otherwise false.

addDuplicate

public void addDuplicate(Iterable<DuDeObjectPair> pairs)
Adds several DuDeObjectPairs to the knowledge base that are labeled as detected duplicates. The pairs are counted as comparisons.

Parameters:
pairs - The pairs that shall be considered as detected duplicates in the statistics.

addNonDuplicate

public void addNonDuplicate(DuDeObjectPair pair)
Adds a DuDeObjectPair to the knowledge base that is labeled as a detected non-duplicate. The pair is counted as comparison.

Parameters:
pair - The pair that shall be considered as a detected non-duplicate in the statistics.

addNonDuplicate

public void addNonDuplicate(DuDeObjectPair pair,
                            boolean actualComparison)
Adds a DuDeObjectPair to the knowledge base that is labeled as a detected non-duplicate.

Parameters:
pair - The pair that shall be considered as a detected non-duplicate in the statistics.
actualComparison - true, if the pair is an actual comparison; otherwise false.

addNonDuplicate

public void addNonDuplicate(Iterable<DuDeObjectPair> pairs)
Adds several DuDeObjectPairs to the knowledge base that are labeled as a detected non-duplicates. The pairs are counted as comparisons.

Parameters:
pairs - The pairs that shall be considered as detected non-duplicates in the statistics.

setBeginningTime

@Deprecated
public void setBeginningTime()
Deprecated. Replaced by setStartTime()

Sets the current time as starting time for the runtime and initiates memory monitoring.


setStartTime

public void setStartTime()
Sets the current time as starting time for the runtime and initiates memory monitoring.


getStartDate

public Date getStartDate()
Gets the date of the specified start time of an algorithm.

Returns:
Date object

setFinishingTime

@Deprecated
public void setFinishingTime()
Deprecated. Replaced by setEndTime()

Sets current time as finishing time for the runtime. Memory monitoring ends if it is activated.


setEndTime

public void setEndTime()
Sets current time as finishing time for the runtime. Memory monitoring ends if it is activated.


getEndDate

public Date getEndDate()
Gets the date of the specified end time of an algorithm.

Returns:
Date object

getRuntime

public long getRuntime()
Gets the time difference between beginning time and finishing time.

Returns:
Runtime in milliseconds

getPrecision

public double getPrecision()
Returns the precision based on the current knowledge base.

Returns:
The precision value.

getRecall

public double getRecall()
Returns the recall based on the current knowledge base. In the context of blocking algorithms the returned value can be considered as pair completeness ratio.

Returns:
The recall value.

getFMeasure

public double getFMeasure()
Returns the f-measure based on the current knowledge base.

Returns:
The f-measure.

getReductionRatio

public double getReductionRatio()
Returns the reduction ratio based on the current knowledge base.

Returns:
The reduction ratio.

getPrecisionByComparison

public double getPrecisionByComparison()
Returns the precision based on the current knowledge base and the actual comparisons.

Returns:
The precision value.

getRecallByComparison

public double getRecallByComparison()
Returns the recall based on the current knowledge base and the actual comparisons. In the context of blocking algorithms the returned value can be considered as pair completeness ratio.

Returns:
The recall value.

getFMeasureByComparison

public double getFMeasureByComparison()
Returns the f-measure based on the current knowledge base and the actual comparisons.

Returns:
The f-measure.

getReductionRatioByComparison

public double getReductionRatioByComparison()
Returns the reduction ratio based on the current knowledge base and the actual comparisons.

Returns:
The reduction ratio.

getNumberOfRealDuplicates

public long getNumberOfRealDuplicates()
Returns the size of the gold standard.

Returns:
The gold standard's size.

getObjectCount

public long getObjectCount()
Returns the number of records that were processed by the algorithm.

Returns:
The record count of the source data set.

getNumberOfCandidateComparisons

public long getNumberOfCandidateComparisons()
Returns the maximum number of pairs that would be generated by the naive approach.

Returns:
The number of pairs that would be generated by the naive approach.

isDuplicate

public boolean isDuplicate(DuDeObjectPair pair)
Returns true if the DuDeObjectPair exists in the set of real duplicate pairs.

Parameters:
pair - The duplicate pair that is to be checked.
Returns:
true, if the duplicate pair exists in the set of real duplicate pairs.

isNonDuplicate

public boolean isNonDuplicate(DuDeObjectPair pair)
Checks whether a specific pair exists in the set of real duplicate pairs.

Parameters:
pair - The duplicate pair that is to be checked.
Returns:
false, if the duplicate pair exists in the set of real duplicate pairs.

getMemoryCheckFrequency

public long getMemoryCheckFrequency()
Gets the frequency of memory checks.

Returns:
Frequency of memory checks in ms.

setMemoryCheckFrequency

public void setMemoryCheckFrequency(long memoryCheckFrequency)
Gets the frequency of memory checks.

Parameters:
memoryCheckFrequency - Frequency of memory checks in ms. Default value is 5000 ms.

isCheckMemory

public boolean isCheckMemory()
Gets the boolean flag that indicates the activation status of memory checking.

Returns:
Boolean flag is TRUE if memory checking is performed.

setCheckMemory

public void setCheckMemory(boolean checkMemory)
Sets the boolean flag that indicates the activation status of memory checking.

Parameters:
checkMemory - Is set to False if memory checking should not be performed. Default value is True.

checkMemoryUsage

protected void checkMemoryUsage()
Starts Memoryusage task


getMaximumMemoryUsed

public String getMaximumMemoryUsed()
Gets the registered maximum amount of memory during the experiment.

Returns:
Maximum amount of memory in KB.

getMinimumMemoryUsed

public String getMinimumMemoryUsed()
Gets the registered minimum amount of memory during the experiment.

Returns:
Minimum amount of memory in KB.

getAverageMemoryUsed

public String getAverageMemoryUsed()
Gets the registered average amount of memory used during the experiment.

Returns:
Average amount of memory in KB.

reset

public void reset()
Sets the attributes for TruePositives, FalsePositives, TruePositivesByComparison, FalsePositivesByComparison, TrueNegativesByComparison, FalseNegativesByComparison, PairCount and ComparisonCount to 0. The algorithm and the gold standard are not changed. This allows to reuse the statistic component for several experiments. Please note that the start and end time for the next experiment has to be set again by using the corresponding methods.



Copyright © 2011 Hasso Plattner Institute - Chair of Information Systems. All Rights Reserved.