Python and decimal character entities over 128.

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Thu Jul 10 01:07:21 EDT 2008


On Wed, 09 Jul 2008 16:39:24 -0700, bsagert wrote:

> Some web feeds use decimal character entities that seem to confuse
> Python (or me).

I guess they confuse you.  Python is fine.

> For example, the string "doesn't" may be coded as "doesn’t" which
> should produce a right leaning apostrophe. Python hates decimal entities
> beyond 128 so it chokes unless you do something like
> string.encode('utf-8').

Python doesn't hate nor chokes on these entities.  It just refuses to
guess which encoding you want, if you try to write *unicode* objects into 
a file.  Files contain byte values not characters.

> Even then, what should have been a right-leaning apostrophe ends up as
> "’". The following script does just that. Look for the string "The
> Canuck iPhone: Apple doesnâ €™t care" after running it.

Then you didn't tell the application you used to look at the result, that
the text is UTF-8 encoded. I guess you are using Windows and
the application expects cp1252 encoded text because an UTF-8 encoded
apostrophe looks like '’' in cp1252.

Choose the encoding you want the result to have and anything is fine. 
Unless you stumble over a feed using characters which can't be encoded
in the encoding of your choice.  That's why UTF-8 might have been a good
idea.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list