[I18n-sig] Encoding auto-detection

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sat, 2 Jun 2001 00:12:14 +0200


> I was talking about a general purpose encoding sniffer, the XML
> case would only be a special case. The idea is to pass a magic
> string to the API and then let it fiddle around with to try 
> to deduce the encoding. The magic string might also be regular
> expression which then has the encoding parameter as group 1, etc.

I see. For a general purpose encoding guesser to be useful, it would
work totally different from the XML autodetection. E.g. UTF-8 can be
detected quite reliably, but you'll have to look at the entire input.

In general, I think encoding auto-detection is a stupid idea, you
really have to have a higher-level protocol that tells you what the
encoding is. Trying Unicode-encodings-autodetection might be more
successful, but I still think it is quite pointless: I predict that
UTF-16 or UTF-32 will be quite rare, and that most Unicode text will
be exchanged as UTF-8. 

In addition, unless you are writing a general-purpose text editor,
there *will* be a higher-level protocol telling you the encoding.

Regards,
Martin