[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Tue Sep 30 19:37:36 CEST 2008


2008/9/30 James Y Knight <foom at fuhm.net>:

>>>> u'\udc90\udc90'.encode('utf-8')
> '\xed\xb2\x90\xed\xb2\x90'

This is wrong: UTF-8 (like other UTF-x) encodes Unicode scalar values,
not Unicode code points, i.e. surrogates as such are unencodable.
'\xed\xb2\x90' is invalid UTF-8.

I've experimentally implemented (not for Python) a different escaping
scheme with a similar goal as UTF-8b: undecodable bytes are prefixed
with U+0000 instead of being converted to unpaired surrogates, and
'\x00' decodes as U+0000 U+0000.

Glib provides some functions to convert filenames for display, in a
way which is not necessarily reversible (includes some hex escapes in
ASCII).

-- 
Marcin Kowalczyk
qrczak at knm.org.pl
http://qrnik.knm.org.pl/~qrczak/


More information about the Python-3000 mailing list