[I18n-sig] UTF-8 and BOM

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Mon, 21 May 2001 16:44:20 +0200


> > >it is absurd to
> > >expect code dealing with *strings* to handle BOMs.
> > 
> > I agree with that, and is a good reason why the codecs should always
> > remove them.
> 
> ??? This is a good reason why the codec should pass the \ufeff
> through, because a \ufeff in a unicode object should not be 
> considered to be a BOM but a ZWNBSP (it might e.g. be used to
> give hints to a hyphenation or ligature algorithm.)

I agree. The decoder should *never* remove the BOM in the middle of a
string.

> Then the write function has an error. A BOM should only be
> written at the start of the file and not on every call to
> write().

I agree. Fixing that should not be too difficult; the codec instance
just needs to change its .encode and .decode attributes after the
first write.

This raises the question what:

f = open("/tmp/foo","w",encoding="utf-16")
f.close()

should give: an empty file, or a file containing just the BOM?

Regards,
Martin