how to avoid leading white spaces

Chris Torek nospam at torek.net
Thu Jun 2 22:58:24 EDT 2011


>In article <94ph22FrhvU5 at mid.individual.net>
> Neil Cerutti <neilc at norwich.edu> wrote:
>> Python's str methods, when they're sufficent, are usually more
>> efficient.

In article <roy-E2FA6F.21571602062011 at news.panix.com>
Roy Smith  <roy at panix.com> replied:
>I was all set to say, "prove it!" when I decided to try an experiment.  
>Much to my surprise, for at least one common case, this is indeed 
>correct.
 [big snip]
>t1 = timeit.Timer("'laoreet' in text",
>                 "text = '%s'" % text)
>t2 = timeit.Timer("pattern.search(text)",
>                  "import re; pattern = re.compile('laoreet'); text = 
>'%s'" % text)
>print t1.timeit()
>print t2.timeit()
>-------------------------------------------------
>./contains.py
>0.990975856781
>1.91417002678
>-------------------------------------------------

This is a bit surprising, since both "s1 in s2" and re.search()
could use a Boyer-Moore-based algorithm for a sufficiently-long
fixed string, and the time required should be proportional to that
needed to set up the skip table.  The re.compile() gets to re-use
the table every time.  (I suppose "in" could as well, with some sort
of cache of recently-built tables.)

Boyer-Moore search is roughly O(M/N) where M is the length of the
text being searched and N is the length of the string being sought.
(However, it depends on the form of the string, e.g., searching
for "ababa" is not as good as searching for "abcde".)

Python might be penalized by its use of Unicode here, since a
Boyer-Moore table for a full 16-bit Unicode string would need
65536 entries (one per possible ord() value).  However, if the
string being sought is all single-byte values, a 256-element
table suffices; re.compile(), at least, could scan the pattern
and choose an appropriate underlying search algorithm.

There is an interesting article here as well:
   http://effbot.org/zone/stringlib.htm
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html



More information about the Python-list mailing list