[Python-3000] Pre-PEP: Easy Text File Decoding
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Mon Oct 16 21:03:31 CEST 2006
"Martin v. Löwis" <martin at v.loewis.de> writes:
> Marcin 'Qrczak' Kowalczyk schrieb:
>> It is true that it can change the interpretation of file contents.
>> This is unavoidable. Unless someone uses unpaired surrogates for this
>> purpose (or code points above U+10FFFF) - I've seen such proposals,
>> but IMHO they are abusing rules too far.
>
> It's not exactly unavoidable: any escaping mechanism can support the
> full range of valid input. In your escaping mechanism, you could
> duplicate 0 bytes on decoding, and write a null byte if you have two
> subsequent NUL characters on encoding.
This is exactly what I am doing. The encoding is able to decode
arbitrary byte sequences, including '\0' bytes, and encodes them back
losslessly.
The point is that it differs from true UTF-8 for strings which contain
'\0' or U+0000. It's unavoidable that it differs from UTF-8 for some
strings, unless code points not encodable in UTF-8 are used.
It doesn't differ from true UTF-8 when there is no '\0' or U+0000.
The fact that it doesn't differ from UTF-8 for some strings means that
for such strings it fires only when UTF-8 decoder would have reported
an error, i.e. that it only changes the behavior of code which would
fail otherwise, that it doesn't break what would work in UTF-8.
My encoder is injective: it accepts U+0000 prefixes only in sequences
which would have been invalid UTF-8.
I agree that it's not suitable for showing the filename for a user.
> I still think that PUA characters would be a better use
What if the filename contains the correct UTF-8 encoding of such PUA
character?
--
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/
More information about the Python-3000
mailing list