Python and decimal character entities over 128.
Marc 'BlackJack' Rintsch
bj_666 at gmx.net
Thu Jul 10 01:07:21 EDT 2008
On Wed, 09 Jul 2008 16:39:24 -0700, bsagert wrote:
> Some web feeds use decimal character entities that seem to confuse
> Python (or me).
I guess they confuse you. Python is fine.
> For example, the string "doesn't" may be coded as "doesn’t" which
> should produce a right leaning apostrophe. Python hates decimal entities
> beyond 128 so it chokes unless you do something like
> string.encode('utf-8').
Python doesn't hate nor chokes on these entities. It just refuses to
guess which encoding you want, if you try to write *unicode* objects into
a file. Files contain byte values not characters.
> Even then, what should have been a right-leaning apostrophe ends up as
> "’". The following script does just that. Look for the string "The
> Canuck iPhone: Apple doesnâ €™t care" after running it.
Then you didn't tell the application you used to look at the result, that
the text is UTF-8 encoded. I guess you are using Windows and
the application expects cp1252 encoded text because an UTF-8 encoded
apostrophe looks like '’' in cp1252.
Choose the encoding you want the result to have and anything is fine.
Unless you stumble over a feed using characters which can't be encoded
in the encoding of your choice. That's why UTF-8 might have been a good
idea.
Ciao,
Marc 'BlackJack' Rintsch
More information about the Python-list
mailing list