Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Terry Reedy tjreedy at udel.edu
Mon Oct 31 19:10:02 EDT 2011


On 10/31/2011 3:54 PM, python at bdurham.com wrote:
> Wondering if there's a fast/efficient built-in way to determine if a
> string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or Tab?

I presume you also want to disallow the other ascii control chars?

> I know I can look at the chars of a string individually and compare them
> against a set of legal chars using standard Python code (and this works

If, by 'string', you mean a string of bytes 0-255, then I would, in 
Python 3, where bytes contain ints in [0,255], make a byte mask of 256 
0s and 1s (not '0's and '1's). Example:

mask = b'\0\1'*121
for c in b'\0\1help': print(mask[c])

1
0
1
0
1
1

In your case, use \1 for forbidden and replace the print with "if 
mask[c]: <found illegal>; break"

In 2.x, where iterating byte strings gives length 1 byte strings, you 
would need ord(c) as the index, which is much slower.

> fine), but I will be working with some very large files in the 100's Gb
> to several Tb size range so I'd thought I'd check to see if there was a
> built-in in C that might handle this type of check more efficiently.
> Does this sound like a use case for cython or pypy?

Cython should get close to c speed, especially with hints. Make sure you 
compile something like the above as Py 3 code.

-- 
Terry Jan Reedy




More information about the Python-list mailing list