Unicode

Mon Dec 17 17:02:18 EST 2012

On 17/12/12 22:09:04, Dave Angel wrote:
> print src.decode("utf-8").encode("latin-1", "ignore")
> 
> That says to decode it using utf-8 (because the html declared a utf-8
> encoding), and encode it back to latin-1 (because your terminal is stuck
> there), then print.
> 
> 
> Just realize that once you start using 'ignore' you're going to also
> ignore discrepancies that are real. For example, maybe your terminal is
> actual something other than either latin-1 or utf-8.

If you need to see such discrepancies, you can do

print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")

That would produce something like:

processeurs Intel® Core™ de 3ème génération av

that is, the problem characters are displayed in &#...; notation.
That is ugly, but sometimes it's the only way to see what character
you really have.

Notice that the number you get is in decimal, where the \u....
notation uses hex:

>>> ord(u"\u2122")
8482
>>>

Hope this helps,

-- HansM