a little parsing challenge ☺

Tue Jul 19 00:30:20 EDT 2011

Billy Mays wrote:

> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).

Shorter version: FUD.

Yes, having a rich and varied character set requires work. Yes, the Unicode
standard itself, and any interface to it (including Python's) are imperfect
(like anything created by fallible humans). But your post is a long and
tedious list of FUD with not one bit of useful advice.

I'm not going to go through the whole post -- life is too short. But here
are two especially egregious example showing that you have some fundamental
misapprehensions about what Unicode actually is:

> Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).

and then later:

> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).

This is confused. Unicode text has code points, text which has been encoded
is nothing but bytes and not code points. "UTF-8 code point" does not even
mean anything. 

The zero width space has code point U+200B. The bytes you get depend on
which encoding you want:

>>> zws = u'\N{Zero Width Space}'
>>> zws
u'\u200b'
>>> zws.encode('utf-8')
'\xe2\x80\x8b'
>>> zws.encode('utf-16')
'\xff\xfe\x0b '

But regardless of which bytes it is encoded into, ZWS always has just a
single code point: U+200B.

You say "A good example of this is with UTF-8 where there are invalid code
points ( such as 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF" but I don't
even understand why you think this is a problem with Unicode. 

0xC0 is not a code point, it is a byte. Not all combinations of bytes are
legal in all files. If you have byte 0xC0 in a file, it cannot be an ASCII
file: there is no ASCII character represented by byte 0xC0, because hex
0xCO = 192, which is larger than 127.

Likewise, if you have a 0xC0 byte in a file, it cannot be UTF-8. It is as
simple as that. Trying to treat it as UTF-8 will give an error, just as
trying to view a mp3 file as if it were a jpeg will give an error. Why you
imagine this is a problem for Unicode is beyond me.

-- 
Steven