[I18n-sig] UTF-8 and BOM
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Mon, 21 May 2001 16:44:20 +0200
> > >it is absurd to
> > >expect code dealing with *strings* to handle BOMs.
> >
> > I agree with that, and is a good reason why the codecs should always
> > remove them.
>
> ??? This is a good reason why the codec should pass the \ufeff
> through, because a \ufeff in a unicode object should not be
> considered to be a BOM but a ZWNBSP (it might e.g. be used to
> give hints to a hyphenation or ligature algorithm.)
I agree. The decoder should *never* remove the BOM in the middle of a
string.
> Then the write function has an error. A BOM should only be
> written at the start of the file and not on every call to
> write().
I agree. Fixing that should not be too difficult; the codec instance
just needs to change its .encode and .decode attributes after the
first write.
This raises the question what:
f = open("/tmp/foo","w",encoding="utf-16")
f.close()
should give: an empty file, or a file containing just the BOM?
Regards,
Martin