Which is faster...find or re. ?
Hans Nowak
wurmy at earthlink.net
Fri Aug 2 21:39:50 EDT 2002
Fearless Freep wrote:
> I'm doing some parsing on HTML files and lookfor for particular tags.
>
> First off given a single line that I want to find a string in, would
> it be quicker to do
>
> if string.find(line, searchString) > -1:
> #process line
>
> or
>
> result = re.compile (searchString).match(line)
> if result:
string.find is usually faster than regular expressions. You shouldn't really
use regexen unless you're looking for a pattern rather than a substring.
> Now, expanding the question, which would probably be quicker.
>
> for line in file.readlines():
> if string.find (....
>
> or
>
> fileContents = file.read()
> searchResults = re.compile (searchString).search(fileContents).
>
> and then looping over searchResults
I don't think these two code snippets do the same, BTW. The first loops over
all lines, and if it finds a certain string, it does something. The second
searches all data for a certain string, and may find the first occurrence, but
not others. You probably want re.findall here.
I think that reading the whole file and then searching the bulk is faster,
although I don't have any hard data or benchmarks to prove it. You might want
to write a little benchmark yourself to see which one is faster. My bet is that
data = f.read()
results = re.findall(pattern, data)
is faster. I guess you'd have to use the re module here since the string module
doesn't have a findall or something similar. Or use:
x = string.find(data, s)
while x > -1:
...do something...
x = string.find(data, s)
If you do use the regex, don't forget to compile the regex before using, it's
much faster.
HTH,
--
Hans (base64.decodestring('d3VybXlAZWFydGhsaW5rLm5ldA=='))
# decode for email address ;-)
The Pythonic Quarter:: http://www.awaretek.com/nowak/
More information about the Python-list
mailing list