the official way of printing unicode strings

J. Clifford Dyer jcd at sdf.lonestar.org
Sun Dec 14 16:55:07 EST 2008


On Sun, 2008-12-14 at 11:16 +0100, Piotr Sobolewski wrote:
> Marc 'BlackJack' Rintsch wrote:
> 
> > I'd make that first line:
> > sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
> > 
> > Why is it even more cumbersome to execute that line *once* instead
> > encoding at every ``print`` statement?
> 
> Oh, maybe it's not cumbersome, but a little bit strange - but sure, I can
> get used to it. 
> 
> My main problem is that when I use some language I want to use it the way it
> is supposed to be used. Usually doing like that saves many problems.
> Especially in Python, where there is one official way to do any elementary
> task. And I just want to know what is the normal, official way of printing
> unicode strings. I mean, the question is not "how can I print the unicode
> string" but "how the creators of the language suppose me to print the
> unicode string". I couldn't find an answer to this question in docs, so I
> hope somebody here knows it.
> 
> So, is it _the_ python way of printing unicode?
> 

The "right way" to print a unicode string is to encode it in the
encoding that is appropriate for your needs (which may or may not be
UTF-8), and then to print it.  What this means in terms of your three
examples is that the first and third are correct, and the second is
incorrect.  The second one breaks when writing to a file, so don't use
it.  Both the first and third behave in the way that I suggest.  The
first (print u'foo'.encode('utf-8')) is less cumbersome if you do it
once,  but the third method (rebinding sys.stdout using codecs.open) is
less cumbersome if you'll be doing a lot of printing on stdout.  

In the end, they are the same method, but one of them introduces another
layer of abstraction.  If you'll be using more than two print statements
that need to be bound to a non-ascii encoding, I'd recommend the third,
as it rapidly becomes less cumbersome, the more you print.  

That said, you should also consider whether you want to rebind
sys.stdout or not.  It makes your print statements less verbose, but it
also loses your reference to the basic stdout.  What if you want to
print using UTF-8 for a while, but then you need to switch to another
encoding later?  If you've used a new name, you can still refer back to
the original sys.stdout.

Right:

my_out = codecs.getwriter('utf-8')(sys.stdout)
print >> my_out u"Stuff"
my_out = codecs.getwriter('ebcdic')(sys.stdout)
print >> my_out u"Stuff"

Wrong

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
print u"Stuff"
sys.stdout = codecs.getwriter('ebcdic')(sys.stdout)
# Now sys.stdout is geting encoded twice, and you'll probably 
# get garbage out. :(
print u"Stuff"

Though I guess this is why the OP is doing:

sys.stdout = codecs.getwriter('utf-8')(sys.__stdout__)

That avoids the problem by not rebinding the original file object.
sys.__stdout__ is still in its original state. 

Carry on, then.

Cheers,
Cliff




More information about the Python-list mailing list