Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?
Terry Reedy
tjreedy at udel.edu
Mon Oct 31 20:44:45 EDT 2011
On 10/31/2011 7:02 PM, Steven D'Aprano wrote:
> On Mon, 31 Oct 2011 17:47:06 -0400, Dave Angel wrote:
>
>> On 10/31/2011 03:54 PM, python at bdurham.com wrote:
>>> Wondering if there's a fast/efficient built-in way to determine if a
>>> string has non-ASCII chars outside the range ASCII 32-127, CR, LF, or
>>> Tab?
>>>
>>> I know I can look at the chars of a string individually and compare
>>> them against a set of legal chars using standard Python code (and this
>>> works fine), but I will be working with some very large files in the
>>> 100's Gb to several Tb size range so I'd thought I'd check to see if
>>> there was a built-in in C that might handle this type of check more
>>> efficiently.
>>>
>>> Does this sound like a use case for cython or pypy?
>>>
>>> Thanks,
>>> Malcolm
>>>
>> How about doing a .replace() method call, with all those characters
>> turning into '', and then see if there's anything left?
>
>
> No offense Dave, but do you really think that making a copy of as much as
> a terabyte of data is *more* efficient than merely scanning the data and
> stopping on the first non-ASCII character you see?
>
>
> There is no way of telling whether a string includes non-ASCII characters
> without actually inspecting each character. So in the event that the
> string *is* fully ASCII text, you have to check every character, there
> can be no shortcuts. However, there is a shortcut if the string isn't
> fully ASCII text: once you've found a single non-text character, stop. So
> the absolute least amount of work you can do is:
>
> # Define legal characters:
> LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
> # everybody forgets about formfeed... \f
> # and are you sure you want to include chr(127) as a text char?
>
> def is_ascii_text(text):
> for c in text:
> if c not in LEGAL:
> return False
> return True
If text is 3.x bytes, this does not work ;-).
OP did not specify bytes or unicode or Python version.
>
>
> Algorithmically, that's as efficient as possible:
This is a bit strange since you go on to explain that it is inefficient
-- O(n*k) where n = text length and k = legal length -- whereas below is
O(n).
> there's no faster way
> of performing the test, although one implementation may be faster or
> slower than another. (PyPy is likely to be faster than CPython, for
> example.)
>
> But can we get better in Python? Yes, I think so. First off, CPython is
> optimized for local variable lookups over globals, and since you are
> looking up the global LEGAL potentially 1000000000000 times, even a 1%
> saving per lookup will help a lot. So the first step is to make a local
> reference, by adding this line just above the for loop:
>
> legal = LEGAL
>
> But we can do even better still. Each time we test for "c not in legal",
> we do a linear search of 100 characters. On average, that will mean
> comparing 50 characters for equality at best. We can do better by using a
> set or frozenset, which gives us approximately constant time lookups:
>
> legal = frozenset(LEGAL)
>
> Finally, we can try to do as much work as possible in fast C code and as
> little as necessary in relatively slow Python:
>
> def is_ascii_text(text):
> legal = frozenset(LEGAL)
> return all(c in legal for c in text)
>
> Since all() is guaranteed to keep short-cut semantics, that will be as
> fast as possible in Python,
A dangerous statement to make. 'c in legal' has to get hash(c) and look
that up in the hash table, possible skipping around a bit if t If text
is byte string rather than unicode, a simple lookup 'mask[c]', where
mask is a 0-1 byte array, should be faster (see my other post). On my
new Pentium Win 7 machine, it is -- by albout 5%. For 100,000,000 legal
bytes, a minimum of 8.69 versus 9.17 seconds.
from time import clock
legal_set = frozenset(range(32, 128))
legal_ray = 128 * b'\1'
illegal = 128 * b'\0' # only testing legal char 'a'
text = b'a' * 100000000
print(clock())
print(all(c in legal_set for c in text), clock()) # min 9.17
t = clock()
print(all(legal_ray[c] for c in text), clock()-t) # min 8.69
##for c in text:
## if illegal[c]: print(False); break # slower, about 9.7
##print(True, clock())
The explicit loop took about 9.7 seconds. It is more flexible as it
could detect the position of the first bad character, or of all of them.
--
Terry Jan Reedy
More information about the Python-list
mailing list