[Python-Dev] Python-3.0, unicode, and os.environ
Ulrich Eckhardt
eckhardt at satorlaser.com
Mon Dec 8 11:20:49 CET 2008
On Sunday 07 December 2008, Guido van Rossum wrote:
> My problem with raising exceptions *by default* when an undecodable
> name exists is that it may render an app completely useless in a
> situation where the developer is no longer around. This happened all
> the time with the 2.x Unicode API, where the developer hadn't
> anticipated a particular input potentially containing non-ASCII bytes,
> and the user fed the application non-ASCII text. Making os.listdir
> raise an exception when a directory contains a single undecodable file
> means that the entire directory can't be read, and most likely the
> entire app crashes at that point. Most likely the developer never
> anticipated this situation (since in most places it is either
> impossible or very unlikely) -- after all, if they had anticipated it
> they would have used the bytes API in the first place.
There is another way to handle this that noisily signals errors but doesn't
cause programs to suddenly fail. Using os.listdir as example, the problem
there is that the OS actually returns a list of strings that can not be
reliably decoded, so I would propose to simply not decode them.
Now, the idea is what if this function simply returned neither a byte string
nor a Unicode string, but e.g. an environment string type (called env_str)?
os.listdir would only fail if it really failed to read the dir. If a user
wants to display an element from the returned list, they would get something
akin to what repr() returns, i.e. a recognisable string that can be written
to a logfile. However, this thing will also include additional markup that
makes it clear that it is not just a piece of text and not suitable to
display to the end user.
This type distinction is important, because it means that any developer will
immediately see that something unexpected is going on here. They will
invoke "type(lst[0])" and see the unexpected type env_str, which will (via
documentation) redirect them to the issue with different encodings and that
all they have to do is 'map( unicode, lst)' in order to get at a list of real
text strings, but they will also read that this operation might fail, forcing
an informed decision.
If they don't care about a textual representation at all but only want to
invoke os.popen with arguments received from the commandline, then everything
is fine, too, because that function will take the strings as they are and
just give them back to the OS. This allows roundtripping from OS over Python
and back to the OS without any conversions and thus without any conversions
that could fail. In the case of e.g. a backup program, this is exactly what
is needed.
Now, if you have any hard-coded strings in your program but a function like
os.popen needs an env_str object, this string is converted via a default
encoding, i.e. the same that is used when converting an env_str object to
Unicode. In this case, I would go so far to say that os.popen should accept
normal str strings, too, and perform that conversion itself. An alternative
way would be to reject the string because it is the wrong type, but since
this internal string's encoding is known, there is no reason to force users
to convert explicitly, it is just that the conversion might fail.
Similarly, when modifying such an env_str object, like e.g. "bak =
sys.argv[1]+'.backup'". In this case, the string '.backup' is converted
according to the default encoding and then appended to the commandline
argument, the result would again be an env_str object.
Note: There is an option in this design, and that is to make the default
behaviour in case of nonconvertable env_str objects configurable. A
filemanager would then replace the undecodable bytes by an approximation, a
backup program would use strict mode and a music player would perhaps simply
skip and ignore such strings. The problem there is that changing this option
would possibly affect other library code that one doesn't even know about
because it is only used indirectly and its implementation is unknown. For
that reason, I would rather not make this policy a configurable element. If
you want that, you can easily code it yourself.
BTW: there was a PEP that proposed a new path class, which was rejected. This
class was actually pretty similar, except that it also included several other
features (globbing, path handling, opening files and the kitchen sink) which
eventually made it too bloated. Otherwise, the idea of creating a separate
type for these strings is the same.
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
**************************************************************************************
Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.
**************************************************************************************
More information about the Python-Dev
mailing list