[Python-Dev] (Not) delaying the 3.2 release

Martin (gzlist) gzlist at googlemail.com
Thu Sep 16 21:43:25 CEST 2010


On 16/09/2010, Guido van Rossum <guido at python.org> wrote:
> On Thu, Sep 16, 2010 at 11:16 AM, Toshio Kuratomi <a.badger at gmail.com>
> wrote:
>> You were talking about encodings that were supersets of 7-bit ASCII.
>> I think Martin was demonstrating a byte string that was a superset of
>> 7-bit
>> ASCII being fed to a stdlib function which went wrong.
>
> Whoops, sorry. I don't have access to Windows so I can't reproduce
> this though. I also don't understand it. What is the Unicode codepoint
> for that 十 character? What is sys.getfilesystemencoding()? What is the
> value of "C:\\十".encode(sys.getfilesystemencoding())?

My fault, should have been clearer. I was trying to demonstrate that
there's a difference between the unix-friendly encodings like UTF-8
and the EUC codecs which only use high-bit characters for non-ascii
text, and the ISO-2022 codecs and Shift JIS.

In the example I gave, 十 encodes in CP932 as '\x8f\\', and the
function gets confused by the second byte. Obviously the right answer
there is just to use unicode, rather than write a function that works
with weird multibyte codecs.

Martin


More information about the Python-Dev mailing list