Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Tue Nov 1 05:21:14 EDT 2011

Steven D'Aprano wrote:

> On Mon, 31 Oct 2011 22:12:26 -0400, Dave Angel wrote:
> 
>> I would claim that a well-written (in C) translate function, without
>> using the delete option, should be much quicker than any python loop,
>> even if it does copy the data.
> 
> I think you are selling short the speed of the Python interpreter. Even
> for short strings, it's faster to iterate over a string in Python 3 than
> to copy it with translate:
> 
>>>> from timeit import Timer
>>>> t1 = Timer('for c in text: pass', 'text = "abcd"')
>>>> t2 = Timer('text.translate(mapping)',
> ...     'text = "abcd"; mapping = "".maketrans("", "")')
>>>> min(t1.repeat())
> 0.450606107711792
>>>> min(t2.repeat())
> 0.9279451370239258

Lies, damn lies, and benchmarks ;)

Copying is fast:

>>> Timer("text + 'x'", "text='abcde '*10**6").timeit(100)
1.819761037826538
>>> Timer("for c in text: pass", "text='abcde '*10**6").timeit(100)
18.89239192008972

The problem with str.translate() (unicode.translate() in 2.x) is that it 
needs a dictionary lookup for every character. However, if like the OP you 
are going to read data from a file to check whether it's (a subset of) 
ascii, there's no point converting to a string, and for bytes (where a 
lookup table with the byte as an index into that table can be used) the 
numbers look quite different:

>>> t1 = Timer("for c in text: pass", "text = b'abcd '*10**6")
>>> t1.timeit(100)
15.818882942199707
>>> t2 = Timer("text.translate(mapping)", "text = b'abcd '*10**6; mapping = 
b''.maketrans(b'', b'')")
>>> t2.timeit(100)
2.821769952774048