Efficient, built-in way to determine if string has non-ASCII chars outside ASCII 32-127, CRLF, Tab?

Tue Nov 1 02:56:51 EDT 2011

On Mon, 31 Oct 2011 20:44:45 -0400, Terry Reedy wrote:

[...]
>> def is_ascii_text(text):
>>      for c in text:
>>          if c not in LEGAL:
>>              return False
>>      return True
> 
> If text is 3.x bytes, this does not work ;-). OP did not specify bytes
> or unicode or Python version.

The OP specified *string*. Whether that's a byte-string in Python 2, or a 
unicode string in Python 3, it will still work, so long as both `text` 
and `LEGAL` are strings.

[steve at sylar ~]$ python2 -c "print 'a' in 'abc'"  # both byte strings
True
[steve at sylar ~]$ python2 -c "print u'a' in u'abc'"  # both unicode strings
True
[steve at sylar ~]$ python3 -c "print('a' in 'abc')"  # both unicode strings
True

Mixing bytes and characters may or may not do what you expect.

[steve at sylar ~]$ python2 -c "print 'a' in u'abc'"  # one of each
True

Whether that's a good thing or a bad thing is open for debate :)

>> Algorithmically, that's as efficient as possible:
> 
> This is a bit strange since you go on to explain that it is inefficient
> -- O(n*k) where n = text length and k = legal length -- whereas below is
> O(n).

But since k is a constant, O(nk) is O(n).

When using Big Oh notation, it is acceptable to hand-wave away a lot of 
detail. For example, it's common to assume that n+7, n//7 and n**7 all 
take the same constant amount of time, which is patently untrue: division 
and exponentiation both require more work than addition. In this case, 
relative to the size of the input text, both a linear string search and a 
constant-time hash lookup are O(1). That doesn't imply that they 
*literally* take constant time.

[...]
>> Since all() is guaranteed to keep short-cut semantics, that will be as
>> fast as possible in Python,
> 
> A dangerous statement to make.

I like living dangerously :)

> 'c in legal' has to get hash(c) and look
> that up in the hash table, possible skipping around a bit if t If text
> is byte string rather than unicode, a simple lookup 'mask[c]', where
> mask is a 0-1 byte array, should be faster (see my other post).

Oooh, clever! I like! It's not necessary to assume bytes, nor is it 
necessary to create a bitmask the size of the entire Unicode range. 
Here's a version for strings (bytes or unicode in Python 2, unicode in 
Python 3):

LEGAL = ''.join(chr(n) for n in range(32, 128)) + '\n\r\t\f'
MASK = ''.join('\01' if chr(n) in LEGAL else '\0' for n in range(128))

# Untested
def is_ascii_text(text):
    for c in text:
        n = ord(c)
        if n >= len(MASK) or MASK[n] == '\0': return False
    return True

Optimizing it is left as an exercise :)

I *suspect*, even with any optimizations, that this will be slower than 
the version using a set.

> On my
> new Pentium Win 7 machine, it is -- by albout 5%. For 100,000,000 legal
> bytes, a minimum of 8.69 versus 9.17 seconds.

On my old Linux box, I get the opposite result: set lookup is a tiny bit 
faster, coincidentally also by almost 5%.

>>> from time import time
>>> legal_set = frozenset(range(32, 128))
>>> text = b'a' * 100000000
>>> t = time(); all(c in legal_set for c in text); time()-t
True
27.174341917037964
>>> 
>>> legal_ray = 128 * b'\1'
>>> t = time(); all(legal_ray[c] for c in text); time()-t
True
28.39691996574402

-- 
Steven