Faster 'if char in string' test

Fri Jun 25 04:29:43 EDT 2004

Kamilche wrote:

> Heheh, I'm a Python newbie. I can't even understand the code you have
> written, but I'll try.

If I don't understand Python code, I tend to blame the code. I'll try to
either explain or rewrite the "dark" pieces below.

> You know, as I was writing stuff like fn(**args.__dict__) yesterday
> (which for me is pretty advanced!), it occurred to me that Python is
> not 'simpler than C', like I had originally thought. :-D It's better
> in many ways, but it has its dark alleyways where a novice can get
> stabbed in the back, too.

Python and C are both simple. Python has friendlier error messages and less
bloat, i. e. manual garbage collection in programs that actually work.
Recent additions to Python and the newstyle/classic class schism, while all
reasonable in themselves, make it harder to "know it all". If decorators
take off, this will become even worse.
Enough pessimism - Python is still much more compact than C++, and if anyone
has the power to say no, then it's Guido van Rossum. If a way to integrate
type inferencing smoothly is actually found, why should anyone want to
program in C at all?

Peter

Now the revised code:

import itertools, re, string

_validchars = string.ascii_letters + string.digits + \
              "!@#$%^&*()`~-_=+[{]}\\|;:\'\",<.>/?\t "
_allchars = string.maketrans("", "")
_invalidchars = _allchars.translate(_allchars, _validchars)

# repeat(v) gives an "endless" sequence of v, so
# zip("abc", repeat(None)) == [("a", None), ("b", None), ("c", None)]
# In Python 2.4 I would use a set but in 2.3.x dicts are faster.
# _invalidset = set(_invalidcars)
_invaliddict = dict(zip(_invalidchars, itertools.repeat(None)))

# contains only valid characters
valid = "This is a string to test for invalid characters."

# contains both valid and invalid characters
invalid = valid + _invalidchars + valid

# A LONG string containing MANY invalid characters to make
# the translate() solution look bad
massaged = (_invalidchars + valid) * 100

# supposing _invalidchars == "abc" we get the regex re.compile("[abc]").
# rInvalid.search(s) would then return a non-None result whenever s
# contains an a, b, or c. The nice thing for our purpose is that it stops
# at the first match, i. e. if the first char is invalid, we don't care
# about the rest
rInvalid = re.compile("[%s]" % re.escape(_invalidchars))

def validate_list(s, invalid=_invalidchars):
    # since the set of invalid characters is used many times,
    # it is faster to have it as a local variable.
    # I don't know why, but achieving this via a default parameter
    # is even faster than a
    # invalid = _invalidchars
    # assignment at the beginning of the function
    for c in s:
        if c in invalid:
            return False
    return True

def validate_dict(s, invalid=_invaliddict):
    # exactly the same as validate_list, but the
    # c in invalid
    # is faster because invalid is a dictionary
    for c in s:
        if c in invalid:
            return False
    return True

def validate_translate(s):
    return len(s.translate(_allchars, _invalidchars)) == len(s)

def validate_regex(s):
    return not rInvalid.search(s)

def validate_noop(s, valid=valid):
    """ Not a valid implementation. Only to estimate the function
        call overhead.
    """
    return s is valid

def measure(func, data):
    """ Run the speed test for a func/data pair and print the result.

        func: name of the function to be tested
        data: name of the variable containing the sample data
    """
    print " ", data,
    setup = "from findcharspeed import %s, %s" % (func, data)
    run = "%s(%s)" % (func, data)
    # example of the above:
    # from findcharspeed import validate_list, valid
    # validate_list(valid) # this statement will be timed

    # timeit.main() expects parameters from the commandline, so we have
    # to add the appropriate switches when invoking it programmatically.
    timeit.main(["-s" + setup, run])

if __name__ == "__main__":
    import timeit

    # functions I want to time
    tests = [validate_list, validate_dict, validate_translate,
             validate_regex, validate_noop]

    # make sure the functions return the correct result
    for f in tests:
        assert f(valid)
        assert not f(invalid)

    # calculate the timings
    for f in tests:
        print f.__name__
        # test each candidate function with 3 different sample strings
        for data in ["valid", "invalid", "massaged"]:
            measure(f.__name__, data)