string processing question

Piet van Oostrum piet at cs.uu.nl
Fri May 1 18:08:55 EDT 2009


>>>>> Kurt Mueller <mu at problemlos.ch> (KM) wrote:

>KM> But from the command line python interprets the code
>KM> as 'latin_1' I presume. That is why I have to convert
>KM> the "ä" with unicode().
>KM> Am I right?

There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
   sequence of bytes and passes them to the shell. How the characters
   are encodes depends on the encoding used in the terminal emulator. So
   for example when the terminal is set to utf-8, your "ä" is converted
   to two bytes: \xc3 and \xa4.
2. The shell passes these bytes to the python command. 
3. The python interpreter must interpret these bytes with some decoding.
   If you use them in a bytes string they are copied as such, so in the
   example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
   If your terminal encoding would have been iso-8859-1, the string
   would have had a single byte '\xe4'. If you use it in a unicode
   string the Python parser has to convert it to unicode. If there is an
   encoding declaration in the source than that is used. Of course it
   should be the same as the actual encoding used by the shell (or the
   editor when you have a script saved in a file) otherwise you have a
   problem. If there is no encoding declaration in the source Python has
   to guess. It appears that in Python 2.x the default is iso-8859-1 but
   in Python 3.x it will be utf-8. You should avoid making any
   assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
   a file, passed as file names or arguments to other processes etc.
   have to be encoded again to a sequence of bytes. In this case Python
   refuses to guess. Also you can't use the same encoding as in step 3,
   because the program can run on a completely different system than
   were it was compiled to byte code. So if the (unicode) string isn't
   ASCII and no encoding is given you get an error. The encoding can be
   given explicitely, or depending on the context, by sys.stdout.encoding,
   sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on). 

Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.

Example:
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
iso-8859-1 bytes.

If I do 
python -c 'print u"ä"' in my terminal I therefore get two characters: ä
but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.

However:
python -c '# -*- coding:utf-8 -*-
print len(u"ä")'
will correctly print 1.
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org



More information about the Python-list mailing list