Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Mon Oct 31 19:02:15 EDT 2011

On Mon, 31 Oct 2011 17:47:06 -0400, Dave Angel wrote:

> On 10/31/2011 03:54 PM, python at bdurham.com wrote:
>> Wondering if there's a fast/efficient built-in way to determine if a
>> string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or
>> Tab?
>>
>> I know I can look at the chars of a string individually and compare
>> them against a set of legal chars using standard Python code (and this
>> works fine), but I will be working with some very large files in the
>> 100's Gb to several Tb size range so I'd thought I'd check to see if
>> there was a built-in in C that might handle this type of check more
>> efficiently.
>>
>> Does this sound like a use case for cython or pypy?
>>
>> Thanks,
>> Malcolm
>>
> How about doing a .replace() method call, with all those characters
> turning into '', and then see if there's anything left?

No offense Dave, but do you really think that making a copy of as much as 
a terabyte of data is *more* efficient than merely scanning the data and 
stopping on the first non-ASCII character you see?

There is no way of telling whether a string includes non-ASCII characters 
without actually inspecting each character. So in the event that the 
string *is* fully ASCII text, you have to check every character, there 
can be no shortcuts. However, there is a shortcut if the string isn't 
fully ASCII text: once you've found a single non-text character, stop. So 
the absolute least amount of work you can do is:

# Define legal characters:
LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
    # everybody forgets about formfeed... \f
    # and are you sure you want to include chr(127) as a text char?

def is_ascii_text(text):
    for c in text:
        if c not in LEGAL:
            return False
    return True

Algorithmically, that's as efficient as possible: there's no faster way 
of performing the test, although one implementation may be faster or 
slower than another. (PyPy is likely to be faster than CPython, for 
example.)

But can we get better in Python? Yes, I think so. First off, CPython is 
optimized for local variable lookups over globals, and since you are 
looking up the global LEGAL potentially 1000000000000 times, even a 1% 
saving per lookup will help a lot. So the first step is to make a local 
reference, by adding this line just above the for loop:

    legal = LEGAL

But we can do even better still. Each time we test for "c not in legal", 
we do a linear search of 100 characters. On average, that will mean 
comparing 50 characters for equality at best. We can do better by using a 
set or frozenset, which gives us approximately constant time lookups:

    legal = frozenset(LEGAL)

Finally, we can try to do as much work as possible in fast C code and as 
little as necessary in relatively slow Python:

def is_ascii_text(text):
    legal = frozenset(LEGAL)
    return all(c in legal for c in text)

Since all() is guaranteed to keep short-cut semantics, that will be as 
fast as possible in Python, and quite possibly just as fast as any C 
extension you might write.

If that's still too slow, use smaller files or get a faster computer :)

-- 
Steven