[Python-3000] Pre-PEP: Easy Text File Decoding

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Sat Oct 14 20:45:54 CEST 2006


"Martin v. Löwis" <martin at v.loewis.de> writes:

> Marcin 'Qrczak' Kowalczyk schrieb:
>> I've implemented a hack which allows simple programs to "just work" in
>> case of UTF-8. It's a modified encoder/decoder which escapes malformed
>> UTF-8 sequences with '\0' bytes, and thus allows arbitrary byte
>> sequences to round-trip UTF-8 decoding and encoding. It's not used by
>> default and it's never used when "UTF-8" is specified explicitly,
>> because it's not the true UTF-8, but I have an environment variable
>> which says "if the locale is UTF-8, use the modified UTF-8 as the
>> default encoding".
>
> Actually, I think there is a "better" (i.e. more unicode-like way):
> use the private-use area.

It changes the interpretation of some filenames which are valid UTF-8
(or generally of texts known to not contain '\0'). My hack is a pure
extension since U+0000 can't be produced by standard UTF-8.

> For Py3k, I would like to propose a standard "binary" codec,
> which is an ASCII superset and decodes bytes 00..7F to ASCII,
> and bytes 80..FF to U+EFxx. This would allow to round-trip
> bytes through text.

It's simpler to use the existing ISO-8859-1 encoding.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Python-3000 mailing list