Efficient mechanism for str.startswith on a set.

Fredrik Lundh fredrik at pythonware.com
Tue Jan 10 09:56:47 EST 2006


Brian Cole wrote:

>I need to iterate through a file, checking whether each line
> 'startswith()' a string that belongs to a set.
>
> Normally, the most efficient way I would do this would be:
> strs=set(['foo','bar'])
> for line in file:
>    if line.strip() in strs:
>        print line
>
> However, for this case I need to do a startswith like this:
> for line in file:
>     for s in strs:
>         if line.startswith(s):
>             print line
>
> This is obviously the least efficient manner to do this as I'll always
> be iterating over the entire 'strs'. I know I could make a binary tree
> out of 'strs' but that's a little more work that don't have time to do
> today. I know there should be something out there if I just ask nicely
> enough.

the RE approach used in this MultiReplace class can be quite efficient:

    http://effbot.org/zone/python-replace.htm

e.g.

        strs.sort() # lexical order
        strs.reverse() # use longest match first
        pattern = string.join(map(re.escape, strs), "|")
        match = re.compile(pattern).match

        for line in file:
            if match(line):
                print line

a less general solution would be to check line[:x] against a dictionary, where
x is max(len(s) for s in strs), and use a fallback test if the test fails (which up-
dates the dictionary).  this only works well if x is relatively small, the files are
relatively regular and not too large, and most lines match.

</F> 






More information about the Python-list mailing list