[I18n-sig] UTF-8 and BOM

Paul Prescod paulp@ActiveState.com
Wed, 16 May 2001 14:41:35 -0700


"Martin v. Loewis" wrote:
> 
> > Python 2.1's UTF-8 decoder seems to treat the BOM as a real leading
> > character. The UTF-16 decoder removes it. I recognize that the BOM is
> > not useful as a "byte order mark" for UTF-8 data but I would still
> > suggest that the UTF-8 decoder should remove it for these reasons:
> 
> I think it is good to remove the BOM when decoding UTF-8. Most likely,
> the only reason that this is not done is that nobody thought that
> there might be one.

Okay good.

> I disagree that putting the BOM into a file is a good thing - I think
> it is stupid to do so. First of all, auto-detection can always be
> fooled, so there should be a higher-level protocol for reliable data
> processing. 

There should be but there isn't always. What is the standard way for
tagging UTF-8 documents on the Windows file system?

> UTF-8 is relatively easy to auto-detect if you believe in
> auto-detection - it's just that looking at the first few bytes it not
> sufficient.

Yes, we're going to autodetect by trying to decode the data but that's a
pretty expensive operation. You never know if the very first non-ASCII
char will appear in the last few bytes of the file. Anyhow, it doesn't
matter. If I want a BOM in files I write out, I can add it. My main goal
is to have the reader do the right thing with "Microsoft-format" Unicode
files.

> OTOH, UTF-8 is concatenation-safe: you can reliably concatenate two
> UTF-8 files to get another UTF-8 file. That properly is lost if there
> is a BOM in the file.

So what if there is a BOM in the middle of the data stream. MAL's
decoder will just remove it anyhow. :)

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook