matching strings in a large set of strings

Bryan bryanjugglercryptographer at yahoo.com
Mon May 3 11:53:18 EDT 2010


Karin Lagesen wrote:
> I have approx 83 million strings, all 14 characters long. I need to be
> able to take another string and find out whether this one is present
> within the 83 million strings.
[...]
> I run out of memory building both the set and the dictionary, so
> what I seem to be left with is the list

I agree with the recommendations to try modules that maintain the set
on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash
to about 8 million buckets, and use a compact representation for the
several strings in each bucket. That gives you most of the speed and
avoids most of the memory overhead. Python makes it easy:


class ConstLenStrSet:

    """ Keep a set of strings all of the same length.
        Support set.add() and Python's 'in' test.
    """

    def __init__(self, strlen=14, nbuckets=8000000):
        assert strlen > 0 and nbuckets > 0
        self.strlen = strlen
        self.nbuckets = nbuckets | 1
        self.table = [''] * self.nbuckets

    def _hashstr(self, s):
        return hash(s) % self.nbuckets

    def add(self, s):
        assert len(s) == self.strlen
        if s not in self:
            self.table[self._hashstr(s)] += s

    def __contains__(self, s):
        data = self.table[self._hashstr(s)]
        for i in range(0, len(data), self.strlen):
            if data[i : i + self.strlen] == s:
                return True
        return False


# Rudimentary demo-test:

from random import choice
from string import letters

def rnd_str(slen=14):
    return ''.join(choice(letters) for _ in range(slen))

# Test against Python's built-in set
sref = set()
for i in range(830000):
    sref.add(rnd_str())

print('Strings generated.')

sset = sset = ConstLenStrSet(14, 80000)
for s in sref:
    sset.add(s)

print 'ConstLenStrSet loaded.'

for s in sref:
    assert s in sset

for i in range(1000):
    s = rnd_str()
    if s in sref:
        print 'That was unlucky.'
    else:
        assert s not in sset



If building the set is too slow, and you know you don't have a lot of
duplicate strings, you can use a faster insert method that doesn't
check whether the string is already in the set:


    def add_quick(self, s):
        assert len(s) == self.strlen
        self.table[self._hashstr(s)] += s


--
--Bryan



More information about the Python-list mailing list