Is there really a default source encoding?
Brian Quinlan
brian at sweetapp.com
Fri Jan 24 02:20:14 EST 2003
> > UTF-8 is certainly not "anglo-neutral". It is often
> > prohibitively expensive to encode Japanese and
> > Chinese text in UTF-8 (UTF-16 is much more popular).
>
> When did "anglo" come to mean "Japanese and Chinese"?
I thought that the OP meant that it was not biased towards English. My
mistake.
> UTF-16 is a horrible hack to work around Unicode failings...i.e.,
> that it started out as a 16 bit system and ended up morphing into
> ISO-10646...which UTF-16 doesn't actually solve, anyway, besides
> being more prohibitively expensive for non-CJK users than UTF-8 is
> for them.
UTF-16 is a compromise encoding. It is equally crappy for almost
everyone. It is also a lot easier to process than UTF-8 for most CJK
applications e.g. it can often be processed as UCS-2.
> [Why UTF-16, rather than UCS-2, though? Is there something in the
> UTF-16 accessible-only-by-surrogate region that CJK users should care
> about?]
Some of the new Japanese characters (i.e. dentistry symbols) are only
available through surrogates.
> Is there such a thing as UTF-32? You mean UCS-4?
> And you said UTF-8 was "prohibitively expensive" ???
No, UTF-32 exists. For Japanese, UTF-8 requires (at minimum) 50% more
space per character than UTF-8. I was being facetious with my UTF-32
comment. But UTF-32 may become more efficient than UTF-16, for some
languages (e.g. Sancrit), in the future.
> >> Great. Only are you sure that BOMs are such a great idea?
>
> It's an immensely stupid idea in a byte-oriented encoding like UTF-8.
> [Though it's pretty dumb in "wide" encodings, too]
I don't understand. In UTF-8, the BOM allows you to easily distinguish
between documents with UTF-8 encoding and a locale dependant
byte-encoding. For multibyte encodings (e.g. UTF-16) it is impossible to
determine the encoding without knowing the byte order. Do you have some
other solution with a feasible implementation?
> > I don't really care about how screwed-up Unix Unicode handling is.
>
> What's "screwed up" about it?
Do you read the OP's link?
Cheers,
Brian
More information about the Python-list
mailing list