[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Stephen McInerney spmcinerney at hotmail.com
Wed Jul 2 01:24:19 CEST 2008


Hi Chris,

> Are you really sure you need this to be ASCII and not UTF-8? If so,
> why do need it to be true ASCII?

I want it to be ASCII so I can print it, and do regex matching.
Unless I need to move with the times, and start doing Unicode regexes as default.
But I'm using 2.5.2 so I'd really prefer to keep everything in ASCII-land.
It's a pain when you're debugging and print keeps throwing exceptions.
And on this case, the apostrophe was not Unicode to start with.

> > But the ASCII encoding of \u2019 is not very human-readable or useful:
> >>>> u'\u2019'.encode('utf-8')
> > '\xe2\x80\x99'
> 
> That's UTF-8, not ASCII (there's a big difference), and you're seeing
> the repr() of the encoded string, which is of course an ugly escape
> sequence.
> If instead you print the encoded string, you get:
> 
> >>> print u'\u2019'.encode('utf-8')
> '

I don't get that, I get this: 'â' (does it depend on C locale settings? if so, that's not very satisfactory at all):
>>> print u'\u2019'.encode('utf-8')
â

Thanks,
Stephen

_________________________________________________________________
Need to know now? Get instant answers with Windows Live Messenger.
http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_messenger_072008
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20080701/04342a42/attachment.htm>


More information about the Baypiggies mailing list