Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Mon Oct 31 20:44:45 EDT 2011

On 10/31/2011 7:02 PM, Steven D'Aprano wrote:
> On Mon, 31 Oct 2011 17:47:06 -0400, Dave Angel wrote:
>
>> On 10/31/2011 03:54 PM, python at bdurham.com wrote:
>>> Wondering if there's a fast/efficient built-in way to determine if a
>>> string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or
>>> Tab?
>>>
>>> I know I can look at the chars of a string individually and compare
>>> them against a set of legal chars using standard Python code (and this
>>> works fine), but I will be working with some very large files in the
>>> 100's Gb to several Tb size range so I'd thought I'd check to see if
>>> there was a built-in in C that might handle this type of check more
>>> efficiently.
>>>
>>> Does this sound like a use case for cython or pypy?
>>>
>>> Thanks,
>>> Malcolm
>>>
>> How about doing a .replace() method call, with all those characters
>> turning into '', and then see if there's anything left?
>
>
> No offense Dave, but do you really think that making a copy of as much as
> a terabyte of data is *more* efficient than merely scanning the data and
> stopping on the first non-ASCII character you see?
>
>
> There is no way of telling whether a string includes non-ASCII characters
> without actually inspecting each character. So in the event that the
> string *is* fully ASCII text, you have to check every character, there
> can be no shortcuts. However, there is a shortcut if the string isn't
> fully ASCII text: once you've found a single non-text character, stop. So
> the absolute least amount of work you can do is:
>
> # Define legal characters:
> LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
>      # everybody forgets about formfeed... \f
>      # and are you sure you want to include chr(127) as a text char?
>
> def is_ascii_text(text):
>      for c in text:
>          if c not in LEGAL:
>              return False
>      return True

If text is 3.x bytes, this does not work ;-).
OP did not specify bytes or unicode or Python version.
>
>
> Algorithmically, that's as efficient as possible:

This is a bit strange since you go on to explain that it is inefficient 
-- O(n*k) where n = text length and k = legal length -- whereas below is 
O(n).

> there's no faster way
> of performing the test, although one implementation may be faster or
> slower than another. (PyPy is likely to be faster than CPython, for
> example.)
>
> But can we get better in Python? Yes, I think so. First off, CPython is
> optimized for local variable lookups over globals, and since you are
> looking up the global LEGAL potentially 1000000000000 times, even a 1%
> saving per lookup will help a lot. So the first step is to make a local
> reference, by adding this line just above the for loop:
>
>      legal = LEGAL
>
> But we can do even better still. Each time we test for "c not in legal",
> we do a linear search of 100 characters. On average, that will mean
> comparing 50 characters for equality at best. We can do better by using a
> set or frozenset, which gives us approximately constant time lookups:
>
>      legal = frozenset(LEGAL)
>
> Finally, we can try to do as much work as possible in fast C code and as
> little as necessary in relatively slow Python:
>
> def is_ascii_text(text):
>      legal = frozenset(LEGAL)
>      return all(c in legal for c in text)
>
> Since all() is guaranteed to keep short-cut semantics, that will be as
> fast as possible in Python,

A dangerous statement to make. 'c in legal' has to get hash(c) and look 
that up in the hash table, possible skipping around a bit if t If text 
is byte string rather than unicode, a simple lookup 'mask[c]', where 
mask is a 0-1 byte array, should be faster (see my other post). On my 
new Pentium Win 7 machine, it is -- by albout 5%. For 100,000,000 legal 
bytes, a minimum of 8.69 versus 9.17 seconds.

from time import clock
legal_set = frozenset(range(32, 128))
legal_ray = 128 * b'\1'
illegal = 128 * b'\0'  # only testing legal char 'a'
text = b'a' * 100000000
print(clock())
print(all(c in legal_set for c in text), clock())  # min 9.17
t = clock()
print(all(legal_ray[c] for c in text), clock()-t)  # min 8.69
##for c in text:
##    if illegal[c]: print(False); break  # slower, about 9.7
##print(True, clock())

The explicit loop took about  9.7 seconds. It is more flexible as it 
could detect the position of the first bad character, or of all of them.

-- 
Terry Jan Reedy