Unicode/ascii encoding nightmare

Mon Nov 6 15:49:13 EST 2006

Thomas W wrote:
> I'm getting really annoyed with python in regards to
> unicode/ascii-encoding problems.
>
> The string below is the encoding of the norwegian word "fødselsdag".
>
> >>> s = 'f\xc3\x83\xc2\xb8dselsdag'

There is no such thing as "*the* encoding" of any given string.

>
> I stored the string as "fødselsdag" but somewhere in my code it got
> translated into the mess above and I cannot get the original string
> back.

Somewhere in your code??? Can't you track through your code to see
where it is being changed? Failing that, can't you show us your code so
that we can help you?

I have guessed *what* you got, but *how* you got it boggles the mind:

The effect is the same as (decode from latin1 to Unicode, encode as
utf8) *TWICE*. That's how you change one byte in the original to *FOUR*
bytes in the "mess":

| >>> orig = 'f\xf8dselsdag'
| >>> orig.decode('latin1').encode('utf8')
| 'f\xc3\xb8dselsdag'
| >>>
orig.decode('latin1').encode('utf8').decode('latin1').encode('utf8')
| 'f\xc3\x83\xc2\xb8dselsdag'
| >>>

> It cannot be printed in the console or written a plain text-file.

Incorrect. *Any* string can be printed on the console or written to a
file. What you mean is that when you look at the output, it is not what
you want.

> I've tried to convert it using
>
> >>> s.encode('iso-8859-1')
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> ordinal not in range(128)

encode is an attribute of unicode objects. If applied to a str object,
the str object is converted to unicode first using the default codec
(typically ascii).

s.encode('iso-8859-1') is effectively
s.decode('ascii').encode('iso-8859-1'), and s.decode('ascii') fails for
the (obvious(?)) reason given.

>
> >>> s.encode('utf-8')
> Traceback (most recent call last):
>   File "<interactive input>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
> ordinal not in range(128)

Same story as for 'iso-8859-1'

>
> And nothing helps. I cannot remember hacing these problems in earlier
> versions of python

I would be very surprised if you couldn't reproduce your problem on any
2.n version of Python.

> and it's really annoying, even if it's my own fault
> somehow, handling of normal characters like this shouldn't cause this
> much hassle. Searching google for "codec can't decode byte" and
> UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm
> not alone.
>
> Any hints?

1. Read the Unicode howto: http://www.amk.ca/python/howto/unicode
2. Read the Python documentation on .decode() and .encode() carefully.
3. Show us your code so that we can help you avoid the double
conversion to utf8. Tell us what IDE you are using.
4. Tell us what you are trying to achieve. Note that if all you are
trying to do is read and write text in Norwegian (or any other language
that's representable in iso-8859-1 aka latin1), then you don't have to
do anything special at all in your code-- this is the good old "legacy"
way of doing things universally in vogue before Unicode was invented!

HTH,
John