Wrong default endianess in utf-16 and utf-32 !?

Antoine Pitrou solipsis at pitrou.net
Tue Oct 12 09:47:06 EDT 2010


On Tue, 12 Oct 2010 06:28:23 -0700 (PDT)
jmfauth <wxjmfauth at gmail.com> wrote:

> I hope my understanding is correct and I'm not dreaming.
> 
> When an endianess is not specified, (BE, LE, unmarked forms),
> the Unicode Consortium specifies, the default byte serialization
> should be big-endian.
> 
[...]
> 
> It appears Python is just working in the opposite way.
> 
[...]
> >>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))
> True

Python uses the host's endianness by default. So, on a little-endian
machine, utf-16 and utf-32 will use little-endian encoding.
While decoding, though, the BOM is read by both of these codecs, so
there should be no interoperability problems:

>>> '\xff\xfea\x00b\x00c\x00'.decode('utf-16')
u'abc'
>>> '\xfe\xff\x00a\x00b\x00c'.decode('utf-16')
u'abc'


(do note, though, that the explicit utf*-be and utf*-le variants do not
add a BOM)

Regards

Antoine.





More information about the Python-list mailing list