matching strings in a large set of strings

Fri Apr 30 05:10:03 EDT 2010

>>>> s = "12345678901234"
>>>> assert len(s) == 14
>>>> import sys
>>>> sys.getsizeof(s)
> 38
>
> So a single 14 char string takes 38 bytes.

Make that at least 40 bytes. You have to take memory alignment into account.

> So a set with 83000 such strings takes approximately 1 MB. So far fairly
> trivial. But that's just the memory used by the container (the set), not
> the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
> is trivial for a modern PC, but the OP is using 83 million such strings,
> not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
> entry level desktop PC these days is generally 2GB, and entry level
> notebooks might have half a gig.

You are pretty much screwed on a 32bit system here. In my experience 
32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually 
malloc() will fail since large amounts of the 4 GB address space is 
reserved for other things like stack, entry point, shared library 
mappings, error detection etc. Memory fragmentation isn't an issue here.

Other ideas:

* use a SQL database with an index on the data column. The index could 
optimize the "starting with" case.
* You need to search for a string inside a large set of texts? Sounds 
like a job for a fulltext search machine! Grab PyLucene and index your 
data inside a lucene database. A SSD helps alot here.

Christian