Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Tim Chase python.list at tim.thechases.com
Mon Oct 31 20:01:14 EDT 2011


On 10/31/11 18:02, Steven D'Aprano wrote:
> # Define legal characters:
> LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
>      # everybody forgets about formfeed... \f
>      # and are you sure you want to include chr(127) as a text char?
>
> def is_ascii_text(text):
>      for c in text:
>          if c not in LEGAL:
>              return False
>      return True
>
>
> Algorithmically, that's as efficient as possible: there's no faster way
> of performing the test, although one implementation may be faster or
> slower than another. (PyPy is likely to be faster than CPython, for
> example.)

Additionally, if one has some foreknowledge of the character 
distribution, one might be able to tweak your

> def is_ascii_text(text):
>      legal = frozenset(LEGAL)
>      return all(c in legal for c in text)

with some if/else chain that might be faster than the hashing 
involved in a set lookup (emphasis on the *might*, not being an 
expert on CPython internals) such as

   def is_ascii_text(text):
     return all(
       (' ' <= c <= '\x7a') or
       c == '\n' or
       c == '\t'
       for c in text)

But Steven's main points are all spot on: (1) use an O(1) lookup; 
(2) return at the first sign of trouble; and (3) push it into the 
C implementation rather than a for-loop. (and the "locals are 
faster in CPython" is something I didn't know)

-tkc











More information about the Python-list mailing list