[Python-3000] Pre-PEP: Easy Text File Decoding

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Mon Oct 16 21:03:31 CEST 2006


"Martin v. Löwis" <martin at v.loewis.de> writes:

> Marcin 'Qrczak' Kowalczyk schrieb:
>> It is true that it can change the interpretation of file contents.
>> This is unavoidable. Unless someone uses unpaired surrogates for this
>> purpose (or code points above U+10FFFF) - I've seen such proposals,
>> but IMHO they are abusing rules too far.
>
> It's not exactly unavoidable: any escaping mechanism can support the
> full range of valid input. In your escaping mechanism, you could
> duplicate 0 bytes on decoding, and write a null byte if you have two
> subsequent NUL characters on encoding.

This is exactly what I am doing. The encoding is able to decode
arbitrary byte sequences, including '\0' bytes, and encodes them back
losslessly.

The point is that it differs from true UTF-8 for strings which contain
'\0' or U+0000. It's unavoidable that it differs from UTF-8 for some
strings, unless code points not encodable in UTF-8 are used.

It doesn't differ from true UTF-8 when there is no '\0' or U+0000.
The fact that it doesn't differ from UTF-8 for some strings means that
for such strings it fires only when UTF-8 decoder would have reported
an error, i.e. that it only changes the behavior of code which would
fail otherwise, that it doesn't break what would work in UTF-8.

My encoder is injective: it accepts U+0000 prefixes only in sequences
which would have been invalid UTF-8.

I agree that it's not suitable for showing the filename for a user.

> I still think that PUA characters would be a better use

What if the filename contains the correct UTF-8 encoding of such PUA
character?

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Python-3000 mailing list