UnicodeEncodeError when not running script from IDE

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Feb 12 19:21:45 EST 2013


Magnus Pettersson wrote:


> # This made the fetching of the website work. Why did i have to write
> # url.encode("UTF-8") when url already is unicode? I feel i dont have a
> # good understanding of this.
> page = urllib2.urlopen(url.encode("UTF-8"))


Start here:

"The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)"

http://www.joelonsoftware.com/articles/Unicode.html


Basically, Unicode is an in-memory data format. Python knows about Unicode
characters (to be technical: code points), but files on disk do not.
Neither do network protocols, or terminals, or other simple devices. They
only understand bytes.

So when you have Unicode text, and you want to write it to a file on disk,
or print it, or send it over the network to another machine, it has to be
*encoded* into bytes, and then *decoded* back into Unicode when you read it
from the file again. Sometimes the system will "helpfully" do that encoding
and decoding automatically for you, which is fine when it works but when it
doesn't it can be perplexing.

There are many, many, many different *encoding schemes*. ASCII is one. UTF-8
is another. And then there are about a bazillion legacy encodings which, if
you are lucky, you will never need to care about. Only some encodings can
deal with the entire range of Unicode characters, most can only deal with a
(typically small) subset of possible characters. E.g. ASCII only knows
about 127 characters out of the million-plus that Unicode deals with.
Latin-1 can handle close to 256 different characters. If you have a say in
the matter, always use UTF-8, since it can handle the full set of Unicode
characters in the most efficient manner.


-- 
Steven




More information about the Python-list mailing list