Re: a little parsing challenge ☺

rusi rustompmody at gmail.com
Mon Jul 18 22:50:21 EDT 2011


On Jul 19, 7:07 am, Billy Mays <no... at nohow.com> wrote:
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>
>
>
> > Billy Mays wrote:
>
> >> On 07/17/2011 03:47 AM, Xah Lee wrote:
> >>> 2011-07-16
>
> >> I gave it a shot.  It doesn't do any of the Unicode delims, because
> >> let's face it, Unicode is for goobers.
>
> > Goobers... that would be one of those new-fangled slang terms that the young
> > kids today use to mean its opposite, like "bad", "wicked" and "sick",
> > correct?
>
> > I mention it only because some people might mistakenly interpret your words
> > as a childish and feeble insult against the 98% of the world who want or
> > need more than the 127 characters of ASCII, rather than understand you
> > meant it as a sign of the utmost respect for the richness and diversity of
> > human beings and their languages, cultures, maths and sciences.
>
> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years)
> Unicode has never appeared to be implemented correctly.  I'm probably
> repeating old arguments here, but whatever.
>
> Unicode is a mess.  When someone says ASCII, you know that they can only
> mean characters 0-127.  When someone says Unicode, do the mean real
> Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
> When using the 'u' datatype with the array module, the docs don't even
> tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that
> all the of these can be figured out, but the problem is now I have to
> ask every one of these questions whenever I want to use strings.
>
> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>
> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program.  If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on.  Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>
> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing  few of these
> in a string.  Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character.  Can you just just check for these
> characters and strip them out?  Yes.  Should you have to?  I would say no.
>
> Does it get better?  Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode).  Are the
> following two domain names the same: tést.com , xn--tst-bma.com ?  Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>
> Can it get even better?  Yep.  We also now need to have a Byte order
> Mark (BOM) to determine the endianness of our characters.  Are they
> little endian or big endian?  (or perhaps one of the two possible middle
> endian encodings?)  Who knows?  String processing with unicode is
> unpleasant to say the least.  I suppose that's what we get when we
> things are designed by committee.
>
> But Hey!  The great thing about standards is that there are so many to
> choose from.
>
> --
> Bill

Thanks for writing that
Every time I try to understand unicode and remain stuck I come to the
conclusion that I must be an imbecile.
Seeing others (probably more intelligent than yours truly) gives me
some solace!

[And I am writing this from India where there are dozens of languages,
almost as many scripts and everyone speaks and writes at least a
couple of non-european ones]



More information about the Python-list mailing list