Wrong default endianess in utf-16 and utf-32 !?

Tue Oct 12 09:28:23 EDT 2010

I hope my understanding is correct and I'm not dreaming.

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

(+ technical papers)

It appears Python is just working in the opposite way.

>>> sys.version
2.7 (r27:82525, Jul  4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
>>> repr(u'abc'.encode('utf-16-le'))
'a\x00b\x00c\x00'
>>> repr(u'abc'.encode('utf-16-be'))
'\x00a\x00b\x00c'
>>> repr(u'abc'.encode('utf-16'))
'\xff\xfea\x00b\x00c\x00'
>>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-be'))
False
>>> repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))
True

Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2

I attempted to find some precise discussions on that subject
and I failed.

Any thougths?