Match beginning of two strings

John Machin sjmachin at lexicon.net
Sat Aug 2 23:18:27 EDT 2003


On Sat, 02 Aug 2003 21:18:32 -0700, Scott David Daniels
<Scott.Daniels at Acm.Org> wrote:

>Ravi wrote:
>> Hi,
>> 
>> I have about 200GB of data that I need to go through and extract the 
>> common first part of a line. Something like this.
>> 
>>  >>>a = "abcdefghijklmnopqrstuvwxyz"
>>  >>>b = "abcdefghijklmnopBHLHT"
>>  >>>c = extract(a,b)
>>  >>>print c
>> "abcdefghijklmnop"
>> 
>> Here I want to extract the common string "abcdefghijklmnop". Basically I 
>> need a fast way to do that for any two given strings. For my situation, 
>> the common string will always be at the beginning of both strings. I can 
>> use regular expressions to do this, but from what I understand there is 
>> a lot of overhead. New data is being generated at the rate of about 1GB 
>> per hour, so this needs to be reasonably fast while leaving CPU time for 
>> other processes.
>> 
>> Thanks
>> Ravi
>> 
>While you can be forgiven for not have guessed, os.path is the place to
>look:
>	import os.path
>	a = "abcdefghijklmnopqrstuvwxyz"
>	b = "abcdefghijklmnopBHLHT"
>	print os.path.commonprefix([a,b])
>

I don't think that os.path.commonprefix was designed with 200Gb of
data in mind. Inspection of Lib/*path.py gives one the impression that
it spends a lot of time discovering that the first element is a prefix
of itself.

Ravi,  You may need to drop down to C to get the speed you want for
your requirement to find the longest common prefix of two strings. Two
things puzzling me: (1) how you would do this with regular expressions
(2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
after a year you have almost 9000Gb; where are you putting it all?
BTW, I do hope your algorithm is not O(N**2) ...

Cheers,
John





More information about the Python-list mailing list