[I18n-sig] Encoding auto-detection

Tom Emerson tree@basistech.com
Fri, 1 Jun 2001 19:43:41 -0400


Martin v. Loewis writes:
> In general, I think encoding auto-detection is a stupid idea, you
> really have to have a higher-level protocol that tells you what the
> encoding is.

This is a utopian idea that completely falls apart in the real world.

It is *very* common for email to be sent making use of both 8-bit and
7-bit encodings with no content-type or content-transfer-encoding.
Without some form of encoding/character set detection you have no idea
what the mail message is encoded with. The fact that the mail RFCs
dictate something is irrelevant.

Similarly you can almost never trust the character encoding specified
for web pages. I have seen a lot of pages that claim to be using
CP1252 or ISO-8859-1 that are actually encoded with Shift-JIS or
EUC-CN or Big 5. Indeed, when I was working on the Device Mosaic
browser (the descendent of NCSA Mosaic that is was targeted for
embedded devices) if we found a document claiming to be Latin-1 we
ignored it and sniffed the encoding.

It is also common to find pages in Japan, China, and Korea that don't
specify a character set or encoding at all... the authors make
assumptions about the people viewing the pages, which may be false. I
have also seen Japanese pages that contain Shift-JIS *and* EUC-JP
encoded characters in the *same* document.

Higher level protocols cannot be believed.

    -tree

> Trying Unicode-encodings-autodetection might be more
> successful, but I still think it is quite pointless: I predict that
> UTF-16 or UTF-32 will be quite rare, and that most Unicode text will
> be exchanged as UTF-8. 

On Unix. This isn't necessarily true on other platforms.

    -tree
-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"