'ascii' codec can't encode character u'\xf3'

John Roth newsgroups at jhrothjr.com
Tue Aug 17 12:51:31 EDT 2004


"Martin Slouf" <xslom03 at vse.cz> wrote in message
news:mailman.1775.1092723467.5135.python-list at python.org...
> i had similar errors:
>
> Traceback (most recent call last):
>   File "/home/martin/skripty/accounts.py", line 125, in ?
>     main(sys.argv)
>   File "/home/martin/skripty/accounts.py", line 119, in main
>     print_accounts(accounts, url_part)
>   File "/home/martin/skripty/accounts.py", line 94, in print_accounts
>     print str(i).encode("utf-8", "replace")
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 151-152: ordinal not in range(128)
>
> - - - -
>
> the solution seems to be:
>
> 0. string is not in unicode encoding (assumption)
> 1. before printing out, convert the string to unicode
> 2. when printing, convert to whatever charset you like
>
> though i dont understand much why (ive solved it a minute ago :) the
> code should be:
>
> str = "any nonunicode string"
> print unicode(str).encode("iso-8859-2", "replace")

I think the terminology is backwards. If you use a unicode string
(that is, u"foo") that string will be in unicode. That's what Python
does with unicode strings. However,
it can't be read or written as such - it has to be decoded
from something else (utf-8, iso-8859-2, whatever)
after being read, and encoded to something (utf-8, iso-8859-1,
whatever) to be written.

A string on disk isn't in "unicode"; it's always in some
encoded format, which is usually utf-8. Or it's in some
single-byte format such as iso-8859-1. Or a far eastern
multi-byte format. A string only winds up in unicode
when it's comfortably ensconsed in a unicode string.

> comments:
>
> 1. why the string is not in unicode can have several reasons -- i guess:
> - does ogg stores tags in unicode?
> - you have parsed an xml file with encoding attribute set (that
> is what i do)
> - etc
>
> 2. "replace" parameter in encode causes non-printable chars to be
> replaced with '?' (you can use "ignore" or strict", see your python
> doc)
>
> 3. the above will work _only_ _if_ the 'str' encoding is "iso-8859-2" --
> a funny thing -- first line of code converts from unknown (but the
> programmer must know it) to unicode and the second one converts it back
> from unicode to unknown (now the programmer tells that secret to python
> :)

Well, the encoding declaration tells Python what to do with unicode
string literals that it finds in the Python source. It doesn't do anything
else.

> 4. i would like to know from any python expert whether/why/why not:
>
> * my assumptions are right

As I said above, the terminology is backwards. "Pure"
unicode only exists in unicode strings. Everything else
is some encoded character set or other in regular single
byte strings, ***including unicode encoded as utf-8.***

> * why is that behaviour? -- if you search google you get
> thousands of errors like this -- with no proper solutions i must add

There's a lot of confusion out there. Lots of people are under
the impression that the encoding declaration somehow does
something magical with unicode, when all, (and I need to
emphasize that, ALL) it does is convert the source code
to unicode in unicode literals using the specified decoding.
Everything outside of unicode literals is treated as a stream
of 8-bit bytes, regardless of the programmer's intentions.

Before the encoding declaration, if you wanted to
include unicode characters in your program you had
to use an editor that encoded in utf-8 and put them
in single byte strings, and then decode those strings
into unicode strings. This was fairly error-prone since
you could drop utf-8 encoded characters somewhere
they didn't belong, causing very difficult to find bugs.

> * is there an easier portable way (no sitecustomize.py changes)
> to do it

The best thing is to ignore the encoding declaration and
write the program as if it wasn't there. On input you need
to somehow determine the encoding of the data and then
decode that into a unicode string; on output you need
to do the reverse and encode the unicode string into a
single byte string before writing it.

You can simplify some of this by using the open
function in the codecs module. That lets you
declare the encoding on open so that the
encoding and decoding happens transparently.

> * i was looking in site.py and there is deleted the
> sys.setdefaultencoding() function, but from the comments i do
> not know why -- you know it? why is user not allowed to change the
> default encoding? it seems reasonable to me if he/she could do that.

That's someone else's answer. I'm not going to get into
the politics behind that, other than to say that there are
very serious release to release compatibility considerations
here.

John Roth

>
> thx.
>
> m.
>





More information about the Python-list mailing list