different encodings for unicode() and u''.encode(), bug?

"Martin v. Löwis" martin at v.loewis.de
Wed Jan 2 15:48:30 EST 2008


> Do not know what the implications of encoding according to "ANSI
> codepage (CP_ACP)" are. Windows only seems clear, but why does it only
> complain when decoding a non-empty string (or when encoding the empty
> unicode string) ?

It has no implications for this issue here. CP_ACP is a Microsoft
invention of a specific encoding alias - the "ANSI code page"
(as Microsoft calls it) is not a specific encoding where I could
specify a mapping from bytes to characters, but instead a
system-global indirection based on a langage default. For example,
in the Western-European/U.S. version of Windows, the default for
CP_ACP is cp1252 (local installation may change that default,
system-wide).

The issue likely has the cause that Piet also guessed: If the
input is an empty string, no attempt to actually perform an
encoding is done, but the output is assumed to be an empty
string again. This is correct behavior for all codecs that Python
supports in its default installation, at least for the direction
bytes->unicode. For the reverse direction, such an optimization
would be incorrect; consider u"".encode("utf-16").

HTH,
Martin



More information about the Python-list mailing list