[I18n-sig] Encoding auto-detection

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sat, 2 Jun 2001 08:59:35 +0200


> It is *very* common for email to be sent making use of both 8-bit and
> 7-bit encodings with no content-type or content-transfer-encoding.

I think this claim is difficult to support by facts. Of the messages I
receive, most do have a MIME header, giving a charset in their
content.

> Indeed, when I was working on the Device Mosaic browser (the
> descendent of NCSA Mosaic that is was targeted for embedded devices)
> if we found a document claiming to be Latin-1 we ignored it and
> sniffed the encoding.

That might be a useful thing to do, but I guess the routine you've
been using was way more complex than what MAL suggested for the
standard library. I doubt you can reliably detect Big 5 by looking at
the first 10 or so bytes of an HTML document.

In fact, I'd suggest that HTML encoding detection is yet again
different from general-purpose encoding detection, since you'll have
to take the declared encoding (if any) into account.

> Higher level protocols cannot be believed.

And neither can autodetection.

Regards,
Martin