Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Tue Nov 1 15:47:02 EDT 2011

python at bdurham.com, 31.10.2011 20:54:
> Wondering if there's a fast/efficient built-in way to determine
> if a string has non-ASCII chars outside the range ASCII 32-127,
> CR, LF, or Tab?
>
> I know I can look at the chars of a string individually and
> compare them against a set of legal chars using standard Python
> code (and this works fine), but I will be working with some very
> large files in the 100's Gb to several Tb size range so I'd
> thought I'd check to see if there was a built-in in C that might
> handle this type of check more efficiently.
>
> Does this sound like a use case for cython or pypy?

Cython. For data of that size, likely read from a fast local RAID drive I 
guess, you certainly don't want to read (part of) the file into a memory 
buffer, then copy that memory buffer into a Python bytes string, and then 
search it character by character, copying each of the characters into a new 
string object just to compare them.

Instead, you'd want to use low-level I/O to read a not-so-small part of the 
file into a memory buffer, run through it looking for unwanted characters, 
and then read the next chunk, without any further copying. The comparison 
loop could look like this, for example:

     cdef unsigned char current_byte
     cdef unsigned char* byte_buffer = libc.stdlib.malloc(BUFFER_SIZE)

     # while read chunk ...
     for current_byte in byte_buffer[:BUFFER_SIZE]:
         if current_byte < 32 or current_byte > 127:
             if current_byte not in b'\t\r\n':
                 raise ValueError()

What kind of I/O API you use is up to you. You may want to use the 
functions declared in libc.stdio (ships with Cython).

Stefan