[Python-Dev] unicode/string asymmetries

Martin v. Loewis martin@v.loewis.de
Fri, 11 Jan 2002 06:47:14 +0100


> > > windows "ansi" is an alias for the encoding you get from
[...]
> > Isn't that also known as "mbcs" in Python? And it is different from
> > "oem", which is not exposed to Python, right?
[...]
> mbcs is an "encoding", but a strange encoding in that it depends on the
> character set.  The character set itself determines what bytes are lead
> bytes.

That is my understanding also.

> Thus, the same mbcs string may be interpreted differently depending on the
> current character set/code page.  Thus "ansi" and "oem" are code pages,
> where mbcs is an encoding.

That is not really true, is it: "ansi" and "oem" are not code pages,
are they? Atleast, not constant code pages, but code pages that depend
on the national version, right?

"mbcs" uses MultiByteToWideChar with CP_ACP, so "mbcs" *is* CP_ACP,
where ACP stands for "ANSI Code Page", right? CP_ACP is the code page
that the "ANSI" functions, i.e. the *A functions, expect. It might be
code page 1252, or it might be something else.

Likewise, the OEM code page is not a fixed thing, either. Instead, it
is what DOS would have used in this locale. So, CP_OEMCP might be 437,
or it might be something else, again, e.g. 850.

I think it might have been less confusing to call the "mbcs" encoding
"ansi", and to expose the "oem" encoding (which can still be done).

Please correct me if I'm wrong.

Regards,
Martin