[Python-ideas] isascii()/islatin1()/isbmp()

Steven D'Aprano steve at pearwood.info
Sun Jul 1 02:59:23 CEST 2012


Serhiy Storchaka wrote:
> As shown in issue #15016 [1], there is a use cases when it is useful to 
> determine that string can be encoded in ASCII or Latin1. In working with 
> Tk or Windows console applications can be useful to determine that 
> string can be encoded in UCS2. C API provides interface for this, but at 
> Python level it is not available.
> 
> I propose to add to strings class new methods: isascii(), islatin1() and 
> isbmp() (in addition to such methods as isalpha() or isdigit()). The 
> implementation will be trivial.
> 
> Pro: The current trick with trying to encode has O(n) complexity and has 
> overhead of exception raising/catching.

Are you suggesting that isascii and friends would be *better* than O(n)? How 
can that work -- wouldn't it have to scan the string and look at each character?

Why just ASCII, Latin1 and BMP (whatever that is, googling has not come up 
with anything relevant)? It seems to me that adding these three tests will 
open the doors to a steady stream of requests for new methods
is<insert encoding name here>.

I suggest that a better API would be a method that takes the name of an 
encoding (perhaps defaulting to 'ascii') and returns True|False:

string.encodable(encoding='ascii') -> True|False

Return True if string can be encoded using the named encoding, otherwise False.


One last pedantic issue: strings aren't ASCII or Latin1, etc., but Unicode. 
There is enough confusion between Unicode text strings and bytes without 
adding methods whose names blur the distinction slightly.




-- 
Steven




More information about the Python-ideas mailing list