Match beginning of two strings

Ravi rxs141 at cwru.edu
Sat Aug 2 23:46:26 EDT 2003


> I don't think that os.path.commonprefix was designed with 200Gb of
> data in mind. Inspection of Lib/*path.py gives one the impression that
> it spends a lot of time discovering that the first element is a prefix
> of itself.
> 
> Ravi,  You may need to drop down to C to get the speed you want for
> your requirement to find the longest common prefix of two strings. Two
> things puzzling me: (1) how you would do this with regular expressions
> (2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
> after a year you have almost 9000Gb; where are you putting it all?
> BTW, I do hope your algorithm is not O(N**2) ...
> 
> Cheers,
> John
> 

Well, I timed os.path.commonprefix with some typical data and it's 
pulling about 40usec per loop. So I did what I hated and coded a little 
function in C. Goes something like this. My reasoning is that usually 
the point where the strings start to differ is in the 30 - 50 character 
range. Basically the idea is the same as a binary search on a sorted 
array. Divide and conquer by going halfway each time.

Read in both strings.
Check to see if the first character matches.
If yes:
     Check halfway through the string and see if that character matches
     Repeatedly check halfway until the difference point is found.
     Go back through from the difference point backwards and make sure 
the characters match from the start to the difference point.

I timed it, and it seems to be doing about 3.5usec per loop. Using 
pipes, I can feed it directly into the processing program. Good enough 
for me.

As for the data, it's data from a radio telescope that's being recorded. 
We do pattern analysis and reduce it to strings. By examining these 
strings (more analysis than just the first common bit of course), we can 
determine what data should be looked at further and what data is 
garbage. The 9000GB problem isn't all that bad, the stuff compresses 
extremely well, down to about 700GB for a year's worth. A couple of RAID 
arrays makes quick work of that.

Thanks,

Ravi





More information about the Python-list mailing list