27.03.2012

Paper accepted at SSDBM

Proceedings of the 24th International Conference on Scientific and Statistical Database Management, 25-27 June 2012 Chania, Crete, Greece

Efficient Similarity Search in Very Large String Sets

Dandy Fenz, Dustin Lange, Astrid Rheinländer, Felix Naumann, and Ulf Leser

Abstract. String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem.

SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI's space consumption can be gracefully traded against search time.
We evaluated SSI on different sets of person names with up to 170~million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.

PDF