[Chicago] understanding unicode problems

Fri Nov 16 17:19:15 CET 2007

On Friday November 16 2007 10:57:10 am Feihong Hsu wrote:
> There's probably no good, complete answer that can be given in a short
> email post. Basically, there's supposed to be a standard encoding for
> unicode: UTF-8. However, go to google.cn for instance and you'll see that

If this isn't outright wrong, it's at least confusing.  AFAIK, there is no 
official standard encoding, though I'd be happy to be corrected.  UTF-8 has 
become the de facto standard, because it's the most comprehensive and sane 
without using an absurd number of bytes per character.  There are a number of 
other functionally similar encodings that aren't used all that much: UTF-7, 
UTF-16.

> So we have to encode/decode because there is no standard encoding yet.
> That's why GB2312 and all those other bizarro encodings are packed into the
> Python standard library.

As for the need for other encodings, we've got 50 years of legacy documents 
that aren't going to magically transform themselves to UTF-8.

-- 
Peter Fein   ||   773-575-0694   ||   pfein at pobox.com
http://www.pobox.com/~pfein/   ||   PGP: 0xCCF6AE6B
irc: pfein at freenode.net   ||   jabber: peter.fein at gmail.com