Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
Tim Chase
python.list at tim.thechases.com
Mon Oct 31 20:01:14 EDT 2011
On 10/31/11 18:02, Steven D'Aprano wrote:
> # Define legal characters:
> LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
> # everybody forgets about formfeed... \f
> # and are you sure you want to include chr(127) as a text char?
>
> def is_ascii_text(text):
> for c in text:
> if c not in LEGAL:
> return False
> return True
>
>
> Algorithmically, that's as efficient as possible: there's no faster way
> of performing the test, although one implementation may be faster or
> slower than another. (PyPy is likely to be faster than CPython, for
> example.)
Additionally, if one has some foreknowledge of the character
distribution, one might be able to tweak your
> def is_ascii_text(text):
> legal = frozenset(LEGAL)
> return all(c in legal for c in text)
with some if/else chain that might be faster than the hashing
involved in a set lookup (emphasis on the *might*, not being an
expert on CPython internals) such as
def is_ascii_text(text):
return all(
(' ' <= c <= '\x7a') or
c == '\n' or
c == '\t'
for c in text)
But Steven's main points are all spot on: (1) use an O(1) lookup;
(2) return at the first sign of trouble; and (3) push it into the
C implementation rather than a for-loop. (and the "locals are
faster in CPython" is something I didn't know)
-tkc
More information about the Python-list
mailing list