Match beginning of two strings
John Machin
sjmachin at lexicon.net
Sat Aug 2 23:18:27 EDT 2003
On Sat, 02 Aug 2003 21:18:32 -0700, Scott David Daniels
<Scott.Daniels at Acm.Org> wrote:
>Ravi wrote:
>> Hi,
>>
>> I have about 200GB of data that I need to go through and extract the
>> common first part of a line. Something like this.
>>
>> >>>a = "abcdefghijklmnopqrstuvwxyz"
>> >>>b = "abcdefghijklmnopBHLHT"
>> >>>c = extract(a,b)
>> >>>print c
>> "abcdefghijklmnop"
>>
>> Here I want to extract the common string "abcdefghijklmnop". Basically I
>> need a fast way to do that for any two given strings. For my situation,
>> the common string will always be at the beginning of both strings. I can
>> use regular expressions to do this, but from what I understand there is
>> a lot of overhead. New data is being generated at the rate of about 1GB
>> per hour, so this needs to be reasonably fast while leaving CPU time for
>> other processes.
>>
>> Thanks
>> Ravi
>>
>While you can be forgiven for not have guessed, os.path is the place to
>look:
> import os.path
> a = "abcdefghijklmnopqrstuvwxyz"
> b = "abcdefghijklmnopBHLHT"
> print os.path.commonprefix([a,b])
>
I don't think that os.path.commonprefix was designed with 200Gb of
data in mind. Inspection of Lib/*path.py gives one the impression that
it spends a lot of time discovering that the first element is a prefix
of itself.
Ravi, You may need to drop down to C to get the speed you want for
your requirement to find the longest common prefix of two strings. Two
things puzzling me: (1) how you would do this with regular expressions
(2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
after a year you have almost 9000Gb; where are you putting it all?
BTW, I do hope your algorithm is not O(N**2) ...
Cheers,
John
More information about the Python-list
mailing list