[I18n-sig] Encoding auto-detection

Paul Prescod paulp@ActiveState.com
Fri, 01 Jun 2001 16:07:59 -0700


"Martin v. Loewis" wrote:
> 
>...
> 
> I see. For a general purpose encoding guesser to be useful, it would
> work totally different from the XML autodetection. 

Agreed. They should be treated as two different problems.

>...
> In general, I think encoding auto-detection is a stupid idea, you
> really have to have a higher-level protocol that tells you what the
> encoding is. 

These protocols are very unreliable. I often see data served from a
website as application/octet-stream no matter what its real data type
is.

> ... Trying Unicode-encodings-autodetection might be more
> successful, but I still think it is quite pointless: I predict that
> UTF-16 or UTF-32 will be quite rare, and that most Unicode text will
> be exchanged as UTF-8.

On Windows, if you save a file as "Unicode", it means UTF-16. I think
that UTF-16 is Microsoft's "standard" Unicode encoding. UTF-8 could be
considered Unix's "standard" encoding.

I don't think you should treat it as either-or. Autodetection is not as
good as really knowing for sure, of course. That doesn't mean that it is
*stupid*. It means it is the best fallback available when dealing with
stupid systems like the Unix file system or misconfigured web servers.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook