de.hpi.fgis.dude.similarityfunction.contentbased.impl
Class TFIDFSimilarityFunction

java.lang.Object
  extended by de.hpi.fgis.dude.similarityfunction.AbstractSimilarityFunction
      extended by de.hpi.fgis.dude.similarityfunction.contentbased.ContentBasedSimilarityFunction<TFIDFSimilarityFunction>
          extended by de.hpi.fgis.dude.similarityfunction.contentbased.impl.TFIDFSimilarityFunction
All Implemented Interfaces:
SimilarityFunction, StringSimilarity, AutoJsonable

public class TFIDFSimilarityFunction
extends ContentBasedSimilarityFunction<TFIDFSimilarityFunction>
implements StringSimilarity

TFIDFSimilarityFunction compares two DuDeObjects based on the classic tf-idf metric. For enabling the tf-idf the DocumentFrequencyPreprocessor has to be set. Otherwise this SimilarityFunction compares the cosine similarity based on term-frequency vectors.

Author:
Ziawasch Abedjan, Matthias Pohl
See Also:
DocumentFrequencyPreprocessor

Nested Class Summary
 
Nested classes/interfaces inherited from interface de.hpi.fgis.dude.similarityfunction.SimilarityFunction
SimilarityFunction.SimilarityValidationState
 
Constructor Summary
protected TFIDFSimilarityFunction()
          Internal constructor for Jsonable deserialization.
  TFIDFSimilarityFunction(DocumentFrequencyPreprocessor idfPreprocessor, int attrIndex, String... defaultAttr)
          Initializes a TFIDFSimilarityFunction object for the passed attribute.
  TFIDFSimilarityFunction(DocumentFrequencyPreprocessor idfPreprocessor, String... defaultAttr)
          Initializes a TFIDFSimilarityFunction object for the passed attribute.
  TFIDFSimilarityFunction(int attrIndex, String... defaultAttr)
          Initializes a TFIDFSimilarityFunction object for the passed attribute.
  TFIDFSimilarityFunction(String... defaultAttr)
          Initializes a TFIDFSimilarityFunction object for the passed attribute.
 
Method Summary
protected  double compareAtomicValues(JsonAtomic value1, JsonAtomic value2)
          Calculates the similarity of the two passed JsonAtomics.
 double getSimilarity(String str1, String str2)
          Returns the similarity of the passed Strings, where 0.0 means that Strings are completely different, and 1.0 indicates that the passed Strings are the same.
 String getSplitToken()
          Returns the split token.
 void setSplitToken(String splitTk)
          Sets the split token.
 String toString()
           
 TFIDFSimilarityFunction withSplitToken(String splitTk)
          Sets the split token and returns the current instance.
 
Methods inherited from class de.hpi.fgis.dude.similarityfunction.contentbased.ContentBasedSimilarityFunction
addAttribute, calculateSimilarity, calculateSimilarity, equals, getAttribute, hashCode, ignoreCapitalization, ignoringCapitalizationEnabled, setCompareArrayArrayStrategy, setCompareArrayAtomicStrategy, setCompareArrayRecordStrategy, setCompareRecordAtomicStrategy, setCompareRecordRecordStrategy
 
Methods inherited from class de.hpi.fgis.dude.similarityfunction.AbstractSimilarityFunction
getLastValidationState, getSimilarity, setValidationState
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

TFIDFSimilarityFunction

protected TFIDFSimilarityFunction()
Internal constructor for Jsonable deserialization.


TFIDFSimilarityFunction

public TFIDFSimilarityFunction(String... defaultAttr)
Initializes a TFIDFSimilarityFunction object for the passed attribute. Note, using this constructor the document frequencies must still be added.

Parameters:
defaultAttr - The attribute for which the tf-based cosine similarity is calculated.

TFIDFSimilarityFunction

public TFIDFSimilarityFunction(int attrIndex,
                               String... defaultAttr)
Initializes a TFIDFSimilarityFunction object for the passed attribute. Note, using this constructor the document frequencies must still be added.

Parameters:
attrIndex - The index of the default attribute. This parameter is used to select specific values of an array.
defaultAttr - The attribute for which the tf-based cosine similarity is calculated.

TFIDFSimilarityFunction

public TFIDFSimilarityFunction(DocumentFrequencyPreprocessor idfPreprocessor,
                               String... defaultAttr)
Initializes a TFIDFSimilarityFunction object for the passed attribute.

Parameters:
idfPreprocessor - The DocumentFrequencyPreprocessor that is needed for calculating the tf-idf similarity.
defaultAttr - The attribute for which the tf-based cosine similarity is calculated.

TFIDFSimilarityFunction

public TFIDFSimilarityFunction(DocumentFrequencyPreprocessor idfPreprocessor,
                               int attrIndex,
                               String... defaultAttr)
Initializes a TFIDFSimilarityFunction object for the passed attribute.

Parameters:
idfPreprocessor - The DocumentFrequencyPreprocessor that is needed for calculating the tf-idf similarity.
attrIndex - The index of the default attribute. This parameter is used to select specific values of an array.
defaultAttr - The attribute for which the tf-based cosine similarity is calculated.
Method Detail

getSplitToken

public String getSplitToken()
Returns the split token.

Returns:
The split token.

setSplitToken

public void setSplitToken(String splitTk)
Sets the split token.

Parameters:
splitTk - The token that is used for splitting the String.

withSplitToken

public TFIDFSimilarityFunction withSplitToken(String splitTk)
Sets the split token and returns the current instance.

Parameters:
splitTk - The token that is used for splitting the String.
Returns:
The current instance.

compareAtomicValues

protected double compareAtomicValues(JsonAtomic value1,
                                     JsonAtomic value2)
Description copied from class: ContentBasedSimilarityFunction
Calculates the similarity of the two passed JsonAtomics.

Specified by:
compareAtomicValues in class ContentBasedSimilarityFunction<TFIDFSimilarityFunction>
Parameters:
value1 - The first atomic value.
value2 - The second atomic value.
Returns:
The similarity of the two passed values.

toString

public String toString()
Overrides:
toString in class Object

getSimilarity

public double getSimilarity(String str1,
                            String str2)
Description copied from interface: StringSimilarity
Returns the similarity of the passed Strings, where 0.0 means that Strings are completely different, and 1.0 indicates that the passed Strings are the same.

Specified by:
getSimilarity in interface StringSimilarity
Parameters:
str1 - The first String.
str2 - The second String.
Returns:
Returns the similarity of the passed Strings or 0.0, if null was passed for at least one String.


Copyright © 2011 Hasso Plattner Institute - Chair of Information Systems. All Rights Reserved.