making a valid file name...

Wed Oct 18 22:05:26 EDT 2006

On 2006-10-18, bearophileHUGS at lycos.com <bearophileHUGS at lycos.com> wrote:
> Tim Chase:
>> In practice, however, for such small strings as the given
>> whitelist, the underlying find() operation likely doesn't put a
>> blip on the radar.  If your whitelist were some huge document
>> that you were searching repeatedly, it could have worse
>> performance.  Additionally, the find() in the underlying C code
>> is likely about as bare-metal as it gets, whereas the set
>> membership aspect of things may go through some more convoluted
>> setup/teardown/hashing and spend a lot more time further from the
>> processor's op-codes.
>
> With this specific test (half good half bad), on Py2.5, on my PC, sets
> start to be faster than the string search when the string "good" is
> about 5-6 chars long (this means set are quite fast, I presume).
>
> from random import choice, seed
> from time import clock
>
> def main(choice=choice):
>     seed(1)
>     n = 100000
>
>     for good in ("ab", "abc", "abcdef", "abcdefgh",
>                  "abcdefghijklmnopqrstuvwxyz"):
>         poss = good + good.upper()
>         data = [choice(poss) for _ in xrange(n)] * 10
>         print "len(good) = ", len(good)
>
>         t = clock()
>         for c in data:
>             c in good
>         print round(clock()-t, 2)
>
>         t = clock()
>         sgood = set(good)
>         for c in data:
>             c in sgood
>         print round(clock()-t, 2), "\n"
>
> main()

On my Python2.4 for Windows, they are often still neck-and-neck
for len(good) = 26. set's disadvantage of having to be
constructed is heavily amortized over 100,000 membership
tests. Without knowing the usage pattern, it'd be hard to choose
between them.

-- 
Neil Cerutti