[I18n-sig] Autoguessing charset for Unicode strings?

Tue, 19 Jun 2001 19:05:29 -0400

Martin v. Loewis writes:
> Now, many email readers will still choke these days when they see
> UTF-8 (the Microsoft ones being positive exceptions here), but will
> recognize Latin-1. So, another procedure might be
> 
> 1. try to encode as ASCII
> 2. if that fails, try iso-8859-1
> 3. if that fails, use UTF-8
> 
> You'll see that this becomes more and more expensive. People now may
> propose that this really should be application controlled, but I think
> they'd be misguided: the application is normally in no better position
> to select a "good" encoding than the library.
> 
> The latter algorithm may also be considered Euro-centric. It probably
> is.

Yes, it is. ;-) Western-Euro-centric, in fact.

One could hint the character set in (2) based on the domain name of
the sender, e.g., if the sender is from .jp then try ISO-2022-JP
instead of 8859-1.

It would be possible to construct a table mapping ranges of Unicode
codepoints (perhaps even character blocks) to certain legacy encodings
so that the correct one can be chosen quickly. Something like this is
needed when transcoding from Unicode to ISO-2022-CN.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"