[Python-3000] BOM handling

Thu Sep 14 01:14:29 CEST 2006

Antoine Pitrou <solipsis at pitrou.net> wrote:
> 
> 
> Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :
> > And is generally ignored, as per unicode spec; it's a "zero width
> > non-breaking space" - an invisible character with no effect on wrapping
> > or otherwise.
> 
> Well it would be better if Py3K (with all strings unicode) makes things
> easy for the programmer and abstracts away those "invisible characters
> with no textual meaning". Currently it's not the case:

> >>> a = "hello".decode("utf-8")
> >>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
> >>> len(a)
> 5
> >>> len(b)
> 6
> >>> a == b
> False

I had also had this particular discussion with another individual
previously (but I can't seem to find it in my archive), and one point
brought up was that apparently Python 2.5 was supposed to have a variant
codec for utf-8 that automatically stripped at most one \ufeff character
from the beginning of decoded output and added it during encoding,
similar to how the generic 'utf-16' and 'utf-32' codecs add and strip:

>>> u'hello'.encode('utf-16')
'\xff\xfeh\x00e\x00l\x00l\x00o\x00'
>>> len(u'hello'.encode('utf-16').decode('utf-16'))
5
>>> 

I'm unable to find that particular utf-8 codec in the version of Python
2.5 I have installed, but I may not be looking in the right places, or
spelling it the right way.

In any case, I believe that the above behavior is correct for the
context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
'utf-16-le' decoding spellings; two of which don't strip.

> >>> a = "hello".encode("utf-16le").decode("utf-16le")
> >>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
> >>> len(a)
> 5
> >>> len(b)
> 6
> >>> a == b
> False

Georg Brandl responded to this example already.

> >>> a
> u'hello'
> >>> b
> u'\ufeffhello'
> >>> print a
> hello
> >>> print b
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode
>     return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

There are two answers to this particular "problem".  Either that is
expected and desireable behavior for all non-utf encodings, or all
non-utf encodings need to gain a mapping of the feff code point to the
empty string.  I think the behavior is expected and desireable.  Why?
Because none of the non-utf encodings have a valid and round-trip-able
representation for the feff code point.

Also, if you want to print possibly arbitrary unicode strings to the
console, you may consider encoding the unicode string first, offering
either 'ignore' or 'replace' as the second argument.

 - Josiah