[Python-3000] Pre-PEP: Easy Text File Decoding

Sat Oct 14 19:03:21 CEST 2006

Marcin 'Qrczak' Kowalczyk schrieb:
> I've implemented a hack which allows simple programs to "just work" in
> case of UTF-8. It's a modified encoder/decoder which escapes malformed
> UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte
> sequences to round-trip UTF-8 decoding and encoding. It's not used by
> default and it's never used when "UTF-8" is specified explicitly,
> because it's not the true UTF-8, but I have an environment variable
> which says "if the locale is UTF-8, use the modified UTF-8 as the
> default encoding".

Actually, I think there is a "better" (i.e. more unicode-like way):
use the private-use area. For "wide" Unicode, chose some "high"
characters, e.g. from plane 16 (say, U+1020xx). For "narrow"
Unicode, chose some from the "middle" (say, U+F4xx). There is
a slight chance of ambiguity here if the actual input also
contains such PUA characters; if you worry about this, you could
escape those.

For Py3k, I would like to propose a standard "binary" codec,
which is an ASCII superset and decodes bytes 00..7F to ASCII,
and bytes 80..FF to U+EFxx. This would allow to round-trip
bytes through text.

Regards,
Martin