[I18n-sig] Encoding auto-detection

Tom Emerson tree@basistech.com
Sat, 2 Jun 2001 13:10:30 -0400


Martin v. Loewis writes:
> > It is *very* common for email to be sent making use of both 8-bit and
> > 7-bit encodings with no content-type or content-transfer-encoding.
> 
> I think this claim is difficult to support by facts. Of the messages I
> receive, most do have a MIME header, giving a charset in their
> content.

I am a computational linguist --- part of the work I've been doing
over the last year is an email corpus, built from messages coming from
a number of mailing lists from over thirteen countries. With over 21K
messages and 60+ MB of text, my experience has been that many of these
messages lack any indication of character set or encoding. I'll write
a script to spin through the headers and determine how many conform to
the standard RFCs, and how many actually include charset information
either in the header or in a MIME body.

> That might be a useful thing to do, but I guess the routine you've
> been using was way more complex than what MAL suggested for the
> standard library. I doubt you can reliably detect Big 5 by looking at
> the first 10 or so bytes of an HTML document.

You can't reliably detect much of anything by looking at the first 10
bytes of a document, unless in a very constrained domain like the
character set detection that spawned this thread. So we agree.

> > Higher level protocols cannot be believed.
> 
> And neither can autodetection.

That's right... I didn't mean to imply that it could. But the two
together can be quite useful, and if you have enough text,
autodetection can be quite accurate. The problem, of course, is that
most text on the web contains a lot of English as well as other
languages.

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"