[Python-Dev] Python-3.0, unicode, and os.environ

Fri Dec 12 10:10:13 CET 2008

On Friday 12 December 2008, Stephen J. Turnbull wrote:
> I gather that the BFDL's line on this thread of discussion is that
> forcing programmers to think about encodings every time they call out
> to the OS is unacceptable

Exactly that is not necessary.

  for n in os.readdir('.'):
      f = open(n)
      if grep('foo', f):
          print('found "foo"!')

Now, if you actually wanted to output the filename, you could never do so 
reliably anyway, because even though it is supposed to be text, the encoding 
isn't known. So, an archiving program will probably do something like this:

   try:
       for n in os.readdir():
           b = n.encode('UTF-8')
           f = open(n)
           archive.write_file_header(b)
           archive.write_file(f)
   catch ...
       print "oops, couldn't decode file '%s'" % n.unicode(error='replace')

If you're writing a filemanager, you would store the path alongside an 
approximated Unicode representation.

> when most programs will work acceptably 
> almost all of the time with a rather naive approach.  This means that
> almost all Python programs will be technically broken for the
> forseeable future, sorry, Ulrich.

Actually, they are already broken, only that few people notice it. :|

> And for the same pragmatic reasons, these functions are going to
> return strings (ie, Unicode), not bytes, I expect.  Sorry, Steve.
>
> What needs to be determined here is the best way to provide
> reliability to those who will go to the effort of asking for it if
> it's available.  I don't think "just return bytes" fits the bill for
> the reason above.
>
> What I would like to see is a type that is derived from string (so if
> you present it to an API expecting string, it is silently treated as
> string), but from which the original bytes can always be extracted on
> request.

I like that idea, this type would behave pretty much like the env_string I 
proposed. The main difference is that it does several implicit conversions 
where I personally would rather see explicit conversions. Other than that, 
I'm all for it.

> If the original bytes cannot be sensibly decoded to a 
> string, then the string field in the object would either contain
> something that should normally cause an error in a string API, or some
> made-up string (presumably it would attempt to be a more or less
> faithful representation of the bytes) at the caller's option.
> Probably they'd also contain some metadata useful in guessing
> encodings (the read time locale in particular).

Well, I wouldn't provide an approximation. Considering the archiving software 
above, you would end up with a file name "<undecodable file name>" in an 
archive. For that kind of software, it would be fatal. But, and that is much 
more important than my preference, at least your approach would allow writing 
reliable software that properly handles such environment strings. Further, 
and that is where it differs from just returning bytes, it even makes it easy 
by the using a distinct type.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**************************************************************************************
           Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.

**************************************************************************************