[Python-Dev] Python-3.0, unicode, and os.environ
Ulrich Eckhardt
eckhardt at satorlaser.com
Fri Dec 12 10:10:13 CET 2008
On Friday 12 December 2008, Stephen J. Turnbull wrote:
> I gather that the BFDL's line on this thread of discussion is that
> forcing programmers to think about encodings every time they call out
> to the OS is unacceptable
Exactly that is not necessary.
for n in os.readdir('.'):
f = open(n)
if grep('foo', f):
print('found "foo"!')
Now, if you actually wanted to output the filename, you could never do so
reliably anyway, because even though it is supposed to be text, the encoding
isn't known. So, an archiving program will probably do something like this:
try:
for n in os.readdir():
b = n.encode('UTF-8')
f = open(n)
archive.write_file_header(b)
archive.write_file(f)
catch ...
print "oops, couldn't decode file '%s'" % n.unicode(error='replace')
If you're writing a filemanager, you would store the path alongside an
approximated Unicode representation.
> when most programs will work acceptably
> almost all of the time with a rather naive approach. This means that
> almost all Python programs will be technically broken for the
> forseeable future, sorry, Ulrich.
Actually, they are already broken, only that few people notice it. :|
> And for the same pragmatic reasons, these functions are going to
> return strings (ie, Unicode), not bytes, I expect. Sorry, Steve.
>
> What needs to be determined here is the best way to provide
> reliability to those who will go to the effort of asking for it if
> it's available. I don't think "just return bytes" fits the bill for
> the reason above.
>
> What I would like to see is a type that is derived from string (so if
> you present it to an API expecting string, it is silently treated as
> string), but from which the original bytes can always be extracted on
> request.
I like that idea, this type would behave pretty much like the env_string I
proposed. The main difference is that it does several implicit conversions
where I personally would rather see explicit conversions. Other than that,
I'm all for it.
> If the original bytes cannot be sensibly decoded to a
> string, then the string field in the object would either contain
> something that should normally cause an error in a string API, or some
> made-up string (presumably it would attempt to be a more or less
> faithful representation of the bytes) at the caller's option.
> Probably they'd also contain some metadata useful in guessing
> encodings (the read time locale in particular).
Well, I wouldn't provide an approximation. Considering the archiving software
above, you would end up with a file name "<undecodable file name>" in an
archive. For that kind of software, it would be fatal. But, and that is much
more important than my preference, at least your approach would allow writing
reliable software that properly handles such environment strings. Further,
and that is where it differs from just returning bytes, it even makes it easy
by the using a distinct type.
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
**************************************************************************************
Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.
**************************************************************************************
More information about the Python-Dev
mailing list