[Python-Dev] Python-3.0, unicode, and os.environ

Ulrich Eckhardt eckhardt at satorlaser.com
Mon Dec 8 11:20:49 CET 2008


On Sunday 07 December 2008, Guido van Rossum wrote:
> My problem with raising exceptions *by default* when an undecodable
> name exists is that it may render an app completely useless in a
> situation where the developer is no longer around. This happened all
> the time with the 2.x Unicode API, where the developer hadn't
> anticipated a particular input potentially containing non-ASCII bytes,
> and the user fed the application non-ASCII text. Making os.listdir
> raise an exception when a directory contains a single undecodable file
> means that the entire directory can't be read, and most likely the
> entire app crashes at that point. Most likely the developer never
> anticipated this situation (since in most places it is either
> impossible or very unlikely) -- after all, if they had anticipated it
> they would have used the bytes API in the first place.

There is another way to handle this that noisily signals errors but doesn't 
cause programs to suddenly fail. Using os.listdir as example, the problem 
there is that the OS actually returns a list of strings that can not be 
reliably decoded, so I would propose to simply not decode them.

Now, the idea is what if this function simply returned neither a byte string 
nor a Unicode string, but e.g. an environment string type (called env_str)? 
os.listdir would only fail if it really failed to read the dir. If a user 
wants to display an element from the returned list, they would get something 
akin to what repr() returns, i.e. a recognisable string that can be written 
to a logfile. However, this thing will also include additional markup that 
makes it clear that it is not just a piece of text and not suitable to 
display to the end user.

This type distinction is important, because it means that any developer will 
immediately see that something unexpected is going on here. They will 
invoke "type(lst[0])" and see the unexpected type env_str, which will (via 
documentation) redirect them to the issue with different encodings and that 
all they have to do is 'map( unicode, lst)' in order to get at a list of real 
text strings, but they will also read that this operation might fail, forcing 
an informed decision.

If they don't care about a textual representation at all but only want to 
invoke os.popen with arguments received from the commandline, then everything 
is fine, too, because that function will take the strings as they are and 
just give them back to the OS. This allows roundtripping from OS over Python 
and back to the OS without any conversions and thus without any conversions 
that could fail. In the case of e.g. a backup program, this is exactly what 
is needed.

Now, if you have any hard-coded strings in your program but a function like 
os.popen needs an env_str object, this string is converted via a default 
encoding, i.e. the same that is used when converting an env_str object to 
Unicode. In this case, I would go so far to say that os.popen should accept 
normal str strings, too, and perform that conversion itself. An alternative 
way would be to reject the string because it is the wrong type, but since 
this internal string's encoding is known, there is no reason to force users 
to convert explicitly, it is just that the conversion might fail.

Similarly, when modifying such an env_str object, like e.g. "bak = 
sys.argv[1]+'.backup'". In this case, the string '.backup' is converted 
according to the default encoding and then appended to the commandline 
argument, the result would again be an env_str object.


Note: There is an option in this design, and that is to make the default 
behaviour in case of nonconvertable env_str objects configurable. A 
filemanager would then replace the undecodable bytes by an approximation, a 
backup program would use strict mode and a music player would perhaps simply 
skip and ignore such strings. The problem there is that changing this option 
would possibly affect other library code that one doesn't even know about 
because it is only used indirectly and its implementation is unknown. For 
that reason, I would rather not make this policy a configurable element. If 
you want that, you can easily code it yourself.

BTW: there was a PEP that proposed a new path class, which was rejected. This 
class was actually pretty similar, except that it also included several other 
features (globbing, path handling, opening files and the kitchen sink) which 
eventually made it too bloated. Otherwise, the idea of creating a separate 
type for these strings is the same.


Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**************************************************************************************
           Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.

**************************************************************************************



More information about the Python-Dev mailing list