[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Wed Jul 2 00:36:29 CEST 2008

Here's one for the XML people,

I am using XML imported from FrameMaker, which contains the unwanted Unicode
character '\u2019' (the character started out as a plain apostrophe in the source Frame document.)
It seems this is a common issue with many word-processors (MS, Frame etc.)
using the funky right- and left- leaning apostrophes. I see many references to this issue on the web.

You can't print Unicode strings as is, it causes an exception, you must encode them (to ASCII).
But the ASCII encoding of \u2019 is not very human-readable or useful:
>>> u'\u2019'.encode('utf-8')
'\xe2\x80\x99'

Hence I thought I should do a find or replace with a regex to map the unwanted \u2019 back to plain old apostrophe.
(You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))

But then I thought:
In the interest of preventing exceptions by making sure all Unicode characters are either mapped to ASCII
or removed, it seems like I really want a Unicode version of string.maketrans() and string.translate(), which is deprecated.
Can anyone tell me what that equivalent is, for Unicode fns?

Thanks,
Stephen

_________________________________________________________________
Use video conversation to talk face-to-face with Windows Live Messenger.
http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_072008
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20080701/bc89bea0/attachment.htm>