Totally confused by Python's string thing.

Mon Dec 16 12:48:44 EST 2002

On Mon, 16 Dec 2002 17:34:30 +0100 (MET), Doru-Catalin Togea
<doru-cat at ifi.uio.no> wrote:

>Hi!
>
>I am doing basic string manipulation with ActivePython 2.2 on Win2000 Pro.
...
>crashes when 'str' contains norwegian letters (=E5=F8=E6=C5=D8=C6), with th=
>e following
>error message:
>=2E..
>File ... , line 53, in doEncode
>   strCopy =3D str.encode('latin-1')
>UnicodeError: ASCII decoding error: ordinal not in range(128)
>
>Can you help me understand how python deals with strings?
....

I know what you're feeling, I've been there before. I struggeled with
it and finally think I grasped it; the detail explanation in german is
here:

http://p-nand-q.com/e/unicode-faq-de.html

I don't have time for it, but I figure it would not be a really bad
idea to translate it to, say, english. Somebody, volunteer? 

If my understanding is correct, you are doing the wrong encoding. You
can see what happens if you type this in Pythonwin:

-------------------
>>> "ö".decode("latin-1")
u'\xf6'
>>> "ö".encode("latin-1")
Traceback (most recent call last):
  File "<interactive input>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)
-------------------

Here is an explanation from the docs:

--------------------
encode takes a unicode string and returns a 8-bit string "containing a
portion (perhaps all) of the Unicode string converted into the given
encoding".

decode is the opposite of encode_func, taking an 8-bit string and
returning a unicode string.
--------------------

So, what happens when you do 

>>> "ö".decode("latin-1")
u'\xf6'

is this: "ö" is treated as a 8-bit string in latin-1, and decoded from
latin-1 as unicode-string. Input: 8-bit-string, Output: Unicode.

On the other hand, this statement:

>>> "ö".encode("latin-1")

First converts "ö" to a unicode string, and then encodes it in
latin-1. (I agree, I find the meanings of "encode" and "decode" a bit
unintuitive, too. But, after reading about this stuff a lot, I manage
;) Anyway, 

>>> "ö".encode("latin-1")

fails, because "ö" cannot be converted to unicode, because the systems
default encoding is 7-bit ascii which doesn't have "ö". 

SOLUTIONS:

Well, probably first tell us more about the problem. Do you have
unicode input? Do you have unicode output? (Because if you don't,
chances are you needn't be concerned about unicode at all).