matching strings in a large set of strings

M.-A. Lemburg mal at egenix.com
Thu May 6 12:26:51 EDT 2010


Dennis Lee Bieber wrote:
> On Thu, 29 Apr 2010 11:38:28 +0200, "Karin Lagesen"
> <karin.lagesen at bio.uio.no> declaimed the following in comp.lang.python:
> 
>> Hello.
>>
>> I have approx 83 million strings, all 14 characters long. I need to be
>> able to take another string and find out whether this one is present
>> within the 83 million strings.
>>
> 	Is this "another string" also exactly 14 characters long?
> 
>> Now, I have tried storing these strings as a list, a set and a dictionary.
>> I know that finding things in a set and a dictionary is very much faster
>> than working with a list, so I tried those first. However, I run out of
>> memory building both the set and the dictionary, so what I seem to be left
>> with is the list, and using the in method.
>>
> 	So don't load them into memory... First use a file-based (not memory
> limited) sort routine on the 80M strings (if position in the file is
> important, make a copy with the line number on the end of each line
> before sorting -- but since you've tried using Sets which do not retain
> order... probably not a matter).
> 
> 	Then, since you know the length of the file, and the length of each
> line is fixed, you can do direct access I/O (well, since the C-stream
> style I/O doesn't do "record" access, you'll need to calculate offsets
> from the start of the file as (record# - 1) * recordlen)...
> 
> 	That lets you do a binary search on the file. Much faster than a
> linear search (linear search will average out to 41.5M read operations;
> binary should be around 10000 reads)
> 
> 	Or load the data into some relational database and let the compiled
> code do the searching -- probably faster than byte-code interpreted
> Python for the same algorithm...

... or put the data into a simple on-disc dictionary such as
what mxBeeBase implements:

    http://www.egenix.com/products/python/mxBase/mxBeeBase/

Once you have that dict set up, lookups are very fast.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/



More information about the Python-list mailing list