Fastest way to detect a non-ASCII character in a list of strings.

Sun Oct 17 22:47:09 EDT 2010

On Mon, 18 Oct 2010 01:04:09 +0100, Rhodri James wrote:

> On Sun, 17 Oct 2010 20:59:22 +0100, Dun Peal <dunpealer at gmail.com>
> wrote:
> 
>> `all_ascii(L)` is a function that accepts a list of strings L, and
>> returns True if all of those strings contain only ASCII chars, False
>> otherwise.
>>
>> What's the fastest way to implement `all_ascii(L)`?
>>
>> My ideas so far are:
>>
>> 1. Match against a regexp with a character range: `[ -~]` 2. Use
>> s.decode('ascii')
>> 3. `return all(31< ord(c) < 127 for s in L for c in s)`
> 
> Don't call it "all_ascii" when you don't mean that; all_printable would
> be more accurate, 

Neither is accurate. all_ascii would be:

all(ord(c) <= 127 for c in string for string in L)

all_printable would be considerably harder. As far as I can tell, there's 
no simple way to tell if a character is printable. You can look at the 
Unicode category, given by unicodedata.category(c), and then decide 
whether or not it is printable.

(Note though that printable characters will not necessarily print, since 
the later relies on there being a glyph available to print. Not all fonts 
include glyphs for all printable character.)

It might be easier to just ignore control characters, and assume 
everything else is printable:

all(unicodedata.category(c) != 'Cc' for c in string for string in L)

If you limit yourself to bytes instead of strings, it's easier:

import string
all(c in string.printable for c in s for s in L)

As for what is faster, that's what timeit and the profiler are for: 
timeit to find out which is faster, and the profiler to find out whether 
it's worse spending the time to find out which is faster.

-- 
Steven