print() and unicode strings (python 3.1)

7stud bbxx789_05ss at yahoo.com
Tue Aug 25 06:41:54 EDT 2009


On Aug 24, 10:09 pm, Ned Deily <n... at acm.org> wrote:
> In article
> <e5e2ec2e-2b4a-4ca8-8c0f-109e5f4eb... at v23g2000pro.googlegroups.com>,
>
>
>
>  7stud <bbxx789_0... at yahoo.com> wrote:
> > On Aug 24, 2:41 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > > > I can't figure out a way to programatically set the encoding for
> > > > sys.stdout.  So where does that leave me?
>
> > > You should be setting the terminal encoding administratively, not
> > > programmatically.
>
> > The terminal encoding has always been utf-8.  It was not set
> > programmatically.
>
> > It seems to me that python 3.1's string handling is broken.
> > Apparently, in python 3.1 I am unable to explicitly set the encoding
> > of a string and print() it out with the result being human readable
> > text.  On the other hand, if I let python do the encoding implicitly,
> > python uses a codec I don't want it to.
>
> If you are running on a Unix-y system, check your locale settings (LANG,
> LC.*, et al).  I think you'll likely find that your locale is really not
> UTF-8.   The following was on Python 3.1 on OS X 10.5, similar results
> on Debian Linux:
>
> $ cat t3.py
> import sys
> print(sys.stdout.encoding)
> s = "¤"
> print(s.encode("utf-8"))
> print(s)
>
> $ export LANG=en_US.UTF-8
> $ python3.1 t3.py
> UTF-8
> b'\xe2\x82\xac'
> ¤
>
> $ export LANG=C
> $ python3.1 t3.py
> US-ASCII
> b'\xe2\x82\xac'
> Traceback (most recent call last):
>   File "t3.py", line 7, in <module>
>     print(s)
> UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
> position 0: ordinal not in range(128)
>
> --
>  Ned Deily,
>  n... at acm.org

Hi,

Thanks for the response.  My OS is mac osx 10.4.11.  I'm not really
sure how to check my locale settings.  Here is some stuff I tried:

$ echo $LANG

$ echo $LC_ALL

$ echo $LC_CTYPE

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

$man locale
...
...
...

ENVIRONMENT:
LANG
Used as a substitute for any unset LC_* variable.  If LANG is unset it
will act as if set to "C".  If any of LANG or LC_* are set to invalide
values locale acts as if they are all unset.

===========

As in your last example, my 'C' settings mean that an ascii codec is
used somewhere to encode() the unicode string.

--
The locale C or POSIX is a portable locale; its LC_CTYPE part
corresponds to the 7-bit ASCII character set.

http://linux.about.com/library/cmd/blcmdl3_setlocale.htm
--


Is this the way it works:


1) python sets the codec for sys.stdout to the LANG environment
variable.
2) It doesn't matter that my terminal's encoding is set to utf-8
because output has to pass through sys.stdout first.

So:

a) My terminal's environment is telling python(and all other programs
running in the terminal) that output sent to sys.stdout must be
encoded in ascii.
b) The solution is to set a LANG environment variable.


Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?


Previously, I've set environment variables that I want to be
permanent, e.g PATH, in ~/.bash_profile, so I did this:

~/.bash_profile:
--------------
...
...
LANG="en_US.UTF-8"
export LANG

and now python 3.1 acts like I expect it to:

-------
import locale
import sys

print(locale.getlocale(locale.LC_CTYPE))
print(sys.stdout.encoding)


s = "€"
print(s)

print(s.encode("utf-8"))

--output:--
('en_US', 'UTF8')
UTF-8
€
b'\xe2\x82\xac'
----------

In conclusion, as far as I can tell, if your python 3.1 program tries
to output a unicode string, and the unicode string cannot be encoded
by the codec specified in the user's LANG environment variable**, then
the user will get an encode error. Just because the programmer's
system can handle the output doesn't mean that another user's system
can.  I guess that's the way it goes: if a user's environment is
telling all programs that it only wants ascii output to go to the
screen(sys.stdout), you can't(or shouldn't) do anything about it.

**Or if the LANG environment variable is not present, then the codec
corresponding to the locale settings(C' corresponds to ascii).

some good locale info:
http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html



More information about the Python-list mailing list