Printing UTF-8

Thu Sep 21 18:47:02 EDT 2006

sheldon.regular at gmail.com wrote:
> I am new to unicode so please bear with my stupidity.
>
> I am doing the following in a Python IDE called Wing with Python 23.
>
> >>> s = "äöü"

>From later evidence, this string is encoded as utf-8. Looks like Wing
must be using an implicit "# coding: utf-8" for interactive input ...

> >>> print s
> Ã¤Ã¶Ã¼

... but uses some other encoding for output. Try doing this, and see
what you get:
   import sys
   print sys.stdout.encoding

> >>> print s
> Ã¤Ã¶Ã¼
> >>> s
> '\xc3\xa4\xc3\xb6\xc3\xbc'

Yup, looks like utf-8 ...

> >>> s.decode('utf-8')
> u'\xe4\xf6\xfc'

Yup, decodes from utf-8 without error

> >>> u = s.decode('utf-8')
> >>> u
> u'\xe4\xf6\xfc'

and those Unicode characters actually look like what you started with:

| >>> import unicodedata as ucd
| >>> [ucd.name(x) for x in u'\xe4\xf6\xfc']
| ['LATIN SMALL LETTER A WITH DIAERESIS', 'LATIN SMALL LETTER O WITH
DIAERESIS',
| LATIN SMALL LETTER U WITH DIAERESIS']
| >>>

So, 3 yups, it must be utf-8.

> >>> print u.encode('utf-8')
> Ã¤Ã¶Ã¼
> >>> print u.encode('latin1')
> äöü
>
> Why can't I get äöü printed from utf-8 and I can from latin1?

Because str objects are just strings of anonymous bytes. They don't
have an attribute that says what encoding their creator had in mind.
Consequently output channels like stdout have an encoding which is
applied to all output. On Windows, in a GUI, this encoding depends on
your locale, and in your case is probably cp1252. cp1252 is very
similar to latin1 but has extra symbols in it. Try repeating the above
exercise, but this time include a trademark symbol in your s string,
and add
    print u.encode("cp1252")
at the end of the exercise.

>  How
> can I use utf-8 exclusivly and be able to print the characters?

print exclusiveutf8.decode('utf-8').encode(whateverittakes)

Why do you want to use utf-8 exclusively? Use it for what?

Basic principle when working with non-ASCII data: decode 8-bit input
into Unicode; process using Unicode-aware software (in Python's case,
the built-in unicode type); if 8-bit output is required, encode your
Unicode data with whatever encoding is required.

>
> I also did the same thing an the same machine in a command window...
> ActivePython 2.3.2 Build 230 (ActiveState Corp.) based on
> Python 2.3.2 (#49, Oct 24 2003, 13:37:57) [MSC v.1200 32 bit (Intel)]
> on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> s = "äöü"
> >>> print s
> äöü
> >>> s
> '\x84\x94\x81'
> >>> s.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
> unexpected code byte
> >>> u = s.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 0:
> unexpected code byte
> >>>
>
> Why such a difference from the IDE to the command window in what it can
> do

Because the command window is the child of MS-DOS, which was the child
of CP/M, and maintains the ancient traditions (like ctrl-Z being taken
as EOF, for example).

> and the internal representation of the unicode?

Unicode? There's no Unicode involved here. In each case you are sending
a string of bytes (0 <= ordinal <= 255) to an output device, each to be
rendered as a bitmap on the screen. Wing evidently causes the renderer
to reach for the latin1 or cp1252 table; the command window is probably
(in your case) using cp850 (or something similar).

On my box, in a command window:
| >>> sys.stdout.encoding
| 'cp850'
| >>> '\x84\x94\x81'.decode('cp850')
| u'\xe4\xf6\xfc'
... which is what you had before.

HTH,
John