Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Repeatability: Zeroth-frequency moment (Cardinality) Estimation

This is a repeatability page for cardinality estimation algorithms. The algorithms are provided in the state their results have been published, but they may not represent the most recent version of their implementations. To get the more up-to-date version of the algorithms, use the binaries provided here.

Algorithms

The number of distinct values of a multiset is usually called the cardinality of this multiset. It is also known as the zeroth-frequency moment of a dataset. 

The following twelve algorithms are the most popular and well-know cardinality estimation algorithms:

Flajolet and Martin (FM)  

1985          jar          code

Probabilistic counting with stochastic averaging(PCSA)

1985jarcode

Linear Counting (LC

1990jarcode

Alon, Martias and Szegedy (AMS)

1996jarcode

Baryossef, Jayram, Kumar, Sivakumar and Trevisan(BJKST)           

2002jarcode

LogLog

2003jarcode

SuperLogLog

2003jarcode

MinCount 

2005jarcode

AKMV

2007jarcode

HyperLogLog 

2008jarcode

Bloom Filters

2010jarcode

HyperLogLog++

2013jarcode

 As a baseline we used a hash table          jar              code    

Datasets

All algorithms have been exhaustively tested on the following datasets:

Dataset#Attributes           #Tuples

NCVoter

25 (of 71) 7,560,886

Openadresses-Europe            

1193,849,474


As well as 90 synthetic datasets were generated by the Mersenne Twister random number generator.