[Baypiggies] Handling unwanted Unicode \u2019 characters in XML
Max Slimmer
max at theslimmers.net
Wed Jul 2 03:21:13 CEST 2008
Stephen McInerney wrote:
> Here's one for the XML people,
>
> I am using XML imported from FrameMaker, which contains the unwanted
> Unicode
> character '\u2019' (the character started out as a plain apostrophe in
> the source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many
> references to this issue on the web.
>
> You can't print Unicode strings as is, it causes an exception, you
> must encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
> >>> u'\u2019'.encode('utf-8')
> '\xe2\x80\x99'
>
> Hence I thought I should do a find or replace with a regex to map the
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
>
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?
>
> Thanks,
> Stephen
>
>
>
>
> ------------------------------------------------------------------------
> Use video conversation to talk face-to-face with Windows Live
> Messenger. Get started.
> <http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_072008>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
I do processing of xml data and often it contains these interesting
unicode chars along with a few other non-ascii chars. The interesting
thing is that there are a small subset of non-ascii chars that don't map
one to one i.e. ascii ordinal value to same unicode ordinal value. I
will include a list of these with their names and you can make up a
small dict to map them to what ever you would like.
but if you want to convert these unicode to string or visa versa look at
the following:
>>> u = u'a\u2019b'
>>> a = 'a\x92b'
>>> u
u'a\u2019b'
>>> a
'a\x92b'
>>> u.encode('cp1252')
'a\x92b'
>>> a.decode('cp1252')
u'a\u2019b'
>>>
the coolest way to convert these (I think) is the following:
import renonAscii = re.compile('[^\x01-\x7f]')
def escapeCP1252(s):
return nonAscii.sub(_esccp1252,s)
def _esccp1252(m):
return "&#%d;" % ord(_cp1252(m))
def _cp1252(m):
c = unichr(ord(m.group(0)))
return cp1252.get(c, c)
the above simply finds all non-ascii chrs and returns an html &#nn;
escape string, if you want to return a quote char instead of u'u\2019'
then have the rtn return cp1252.get(ord(m.group(0)) after modifying the
dict cp1252 to your taste.
cp1252 = {
# from http://www.microsoft.com/typography/unicode/1252.htm
u"\x80": u"\u20AC", # EURO SIGN
u"\x82": u"\u201A", # SINGLE LOW-9 QUOTATION MARK
u"\x83": u"\u0192", # LATIN SMALL LETTER F WITH HOOK
u"\x84": u"\u201E", # DOUBLE LOW-9 QUOTATION MARK
u"\x85": u"\u2026", # HORIZONTAL ELLIPSIS
u"\x86": u"\u2020", # DAGGER
u"\x87": u"\u2021", # DOUBLE DAGGER
u"\x88": u"\u02C6", # MODIFIER LETTER CIRCUMFLEX ACCENT
u"\x89": u"\u2030", # PER MILLE SIGN
u"\x8A": u"\u0160", # LATIN CAPITAL LETTER S WITH CARON
u"\x8B": u"\u2039", # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
u"\x8C": u"\u0152", # LATIN CAPITAL LIGATURE OE
u"\x8E": u"\u017D", # LATIN CAPITAL LETTER Z WITH CARON
u"\x91": u"\u2018", # LEFT SINGLE QUOTATION MARK
u"\x92": u"\u2019", # RIGHT SINGLE QUOTATION MARK
u"\x93": u"\u201C", # LEFT DOUBLE QUOTATION MARK
u"\x94": u"\u201D", # RIGHT DOUBLE QUOTATION MARK
u"\x95": u"\u2022", # BULLET
u"\x96": u"\u2013", # EN DASH
u"\x97": u"\u2014", # EM DASH
u"\x98": u"\u02DC", # SMALL TILDE
u"\x99": u"\u2122", # TRADE MARK SIGN
u"\x9A": u"\u0161", # LATIN SMALL LETTER S WITH CARON
u"\x9B": u"\u203A", # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
u"\x9C": u"\u0153", # LATIN SMALL LIGATURE OE
u"\x9E": u"\u017E", # LATIN SMALL LETTER Z WITH CARON
u"\x9F": u"\u0178", # LATIN CAPITAL LETTER Y WITH DIAERESIS
More information about the Baypiggies
mailing list