[Python-Dev] PEP 383 update: utf8b is now the error handler

Wed May 6 22:17:05 CEST 2009

On approximately 5/6/2009 12:18 PM, came the following characters from 
the keyboard of Zooko Wilcox-O'Hearn:
> On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:
> 
>> Zooko Wilcox-O'Hearn <zooko <at> zooko.com> writes:
>>>
>>> I'm not thinking of API compatibility as much as data compatibility 
>>> -- someone used Python 3.1 to write down some filenames, and now a 
>>> few years later they are trying to use the latest and greatest Python 
>>> release to read those filenames...
>>
>> Well, if the filenames are generated by Python (as opposed to read 
>> from an existing directory on disk), they should be regular unicode 
>> objects without any lone surrogates, so I don't see the compatibility 
>> problem.
> 
> I meant that the application reads filenames from an existing directory 
> on disk, saves those filenames, and then later, using a future version 
> of Python, wants to read them and use them.

Regarding future versions of Python.  In the worst case, even if 
Python's default behavior changes, the transcoding done by PEP 383 can 
be done in other software too... it is a straightforward, fully 
specified, 1-to-1, reversible transcoding process, affecting and 
generating only invalid byte encodings on one side, and invalid Unicode 
sequences on the other.

So if Python's default behavior should change, the transcoding 
implemented by PEP 383 could be easily reimplemented to enable a future 
version of a Python application to manipulate the transcoded, saved, 
filenames.

By easily, I mean that I could code it in a couple hours, max.

> I'm not saying that I know this would be a problem.  I'm saying that I 
> personally can't tell whether it would be a problem or not, and the 
> extensive discussions so far have not convinced me that there is anyone 
> who both understands PEP 383 and considers this use case.

Does the above help?

> Many people who apparently understand encoding issues well have said 
> something to the effect that there is no problem, but those people 
> haven't yet managed to get through my thick skull how I would use PEP 
> 383 safely for this sort of use case -- the one where data generated by 
> os.listdir() travels forward in time or the one were that data travels 
> sideways to other systems, including Windows or other systems that 
> validate incoming unicode.

Regarding data traveling sideways, some comments:

1) PEP 383's effect could be recoded in other languages as easily as it 
is in Python (or the C in which Python is implmented).  So that could be 
a solution.

2) You mention "Windows" and "other systems that validate incoming 
unicode" in the same phrase, as if you think that "Windows" qualifies as 
  an "other systems that validate incoming unicode", but it does not (at 
least not universally).

> That's why I am a bit uncomfortable about PEP 383 being quickly 
> implemented and deployed in Python 3.1.

Does the above help?

> By the way, much of the detailed discussion about what Tahoe requires 
> and how that may or may not benefit from PEP 383 has now moved to the 
> tahoe-dev mailing list: 
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev .

I have no background with Tahoe, nor particular interest, although it 
sounds like a useful project... so I won't be joining that list.  I have 
no idea if there is an installed base of existing Tahoe file systems, my 
suggestions below assume that there is not, and that you are presently 
inventing them.  Therefore, I provide no migration path, although I 
could invent one, but it would take longer to describe.

However, since I'm responding here, and have read what you have posted 
here, it seems like the following could be true.

Assumptions from your emails:

A) Tahoe wants to provide a UTF-8 file name system
B) Tahoe wants to interface to POSIX systems that use (and do not 
validate) byte interfaces.
C) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
name interfaces, with no validation.
D) Tahoe wants to interface to non-POSIX systems that use 16-bit file 
name interfaces, with validation.

Uncertainties: I'm not clear on what your goals are for Tahoe filenames. 
  There seem to be 2 possibilities:

1) you want to reject attempts to use non-validating Unicode, be it from 
a 16-bit interface, or a bytes interface.
2) you don't want to reject non-validating Unicode, but you want to 
convert it to valid Unicode for (D) systems.

3) Orthogonally, you might want to store only Valid Unicode in the 
names, or you might not care, if you can meet the other goals.

Truisms:

If you want to support (D), and (2), then you must transform names at 
some point, using some scheme, because not all names supplied by (B) 
systems will be acceptable to (D) systems.  You can choose to do this 
transformation when a (B) system provides an invalid (per Unicode) name, 
or you can choose to do the transformation when a (D) system accesses a 
file with an invalid (per Unicode) name.

If the (B) and (D) systems talk to each other outside of Tahoe, they 
will have to do similar transformations, or, if they both access the 
same Tahoe system, they will have to do the identical transformation, to 
be sure that they can access the same file.

All transcoding schemes have the possibility of data puns between 
non-transcoded names and transcoded names.  In order to successfully and 
properly manipulate a name, you must know whether or not it has been 
transcoded, and how.

PEP 383 limits its transcoding to names that are invalid (per Unicode). 
   Names that cannot be properly decoded to Unicode are decoded to 
invalid Unicode.  Names that are invalid Unicode are encoded to invalid 
byte sequences (per the encoding scheme specified).

For PEP 383 and Python, transcoded names can be distinguished by 
checking for the existence of lone surrogates in the str form of the 
filename, or by attempting to do a strict decoding of the bytes form of 
the filename, depending on what you have (generally, the former).

For PEP 383 and Python, the names will round trip from the POSIX bytes 
interfaces to the program, and back to POSIX bytes interfaces, as long 
as only Python wrappers of system functions are used, and the filesystem 
encoding is not changed between calls (or is restored).  Passing them to 
3rd party libraries or other systems requires extra work, if there is a 
desire to manipulate files with names that are not decodeable to Unicode 
by the standard decoding algorithm for that encoding.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking