[Python-3000] New proposition for Python3 bytes filename issue

Tue Sep 30 14:20:28 CEST 2008

On Tue, Sep 30, 2008 at 3:28 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> Adam Olsen <rhamph <at> gmail.com> writes:
>>
>> The only way to display that file would be to transform it into some
>> other valid unicode string.  However, as that string is already valid,
>> you've just made any files named after it impossible to open.
>
> Not if those valid sequences are also properly escaped to avoid collisions.
> That's what utf-8b claims to do.
>
> My view of utf-8b is that if is not really  a new codec, but an escaping phase
> added in front of utf-8, such that illegal byte sequences get converted to legal
> byte sequences. This is how e.g. XML-escaping works ("&" -> "&amp;", etc.). The
> only difficulty being in choosing sufficiently rare escaping sequences, so that
> readability is not impacted.

UTF-8b uses lone surrogates, which are malformed.

You bring up a good point though.  That sort escaping is lossless, and
a PUA escape character would be unlikely to collide.  It would still
fail if another API was used to open the file (gtk or openoffice?),
and the thought of it creeping into other apps gives me an icky
feeling.

-- 
Adam Olsen, aka Rhamphoryncus