[Baypiggies] Handling unwanted Unicode \u2019 characters in XML

Wed Jul 2 03:21:13 CEST 2008

Stephen McInerney wrote:
> Here's one for the XML people,
>
> I am using XML imported from FrameMaker, which contains the unwanted 
> Unicode
> character '\u2019' (the character started out as a plain apostrophe in 
> the source Frame document.)
> It seems this is a common issue with many word-processors (MS, Frame etc.)
> using the funky right- and left- leaning apostrophes. I see many 
> references to this issue on the web.
>
> You can't print Unicode strings as is, it causes an exception, you 
> must encode them (to ASCII).
> But the ASCII encoding of \u2019 is not very human-readable or useful:
> >>> u'\u2019'.encode('utf-8')
> '\xe2\x80\x99'
>
> Hence I thought I should do a find or replace with a regex to map the 
> unwanted \u2019 back to plain old apostrophe.
> (You can do Unicode regexes with re.compile(<pattern>, re.UNICODE))
>
> But then I thought:
> In the interest of preventing exceptions by making sure all Unicode 
> characters are either mapped to ASCII
> or removed, it seems like I really want a Unicode version of 
> string.maketrans() and string.translate(), which is deprecated.
> Can anyone tell me what that equivalent is, for Unicode fns?
>
> Thanks,
> Stephen
>
>
>
>
> ------------------------------------------------------------------------
> Use video conversation to talk face-to-face with Windows Live 
> Messenger. Get started. 
> <http://www.windowslive.com/messenger/connect_your_way.html?ocid=TXT_TAGLM_WL_Refresh_messenger_video_072008> 
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
I do processing of xml data and often it contains these interesting 
unicode chars along with a few other non-ascii chars. The interesting 
thing is that there are a small subset of non-ascii chars that don't map 
one to one  i.e. ascii ordinal value to same unicode ordinal value. I 
will include a list of these with their names and you can make up a 
small dict to map them to what ever you would like.

but if you want to convert these unicode to string or visa versa look at 
the following:
 >>> u = u'a\u2019b'
 >>> a = 'a\x92b'
 >>> u
u'a\u2019b'
 >>> a
'a\x92b'
 >>> u.encode('cp1252')
'a\x92b'
 >>> a.decode('cp1252')
u'a\u2019b'
 >>>
the coolest way to convert these (I think) is the following:
import renonAscii = re.compile('[^\x01-\x7f]')   

def escapeCP1252(s):
    return nonAscii.sub(_esccp1252,s)
def _esccp1252(m):
    return "&#%d;" % ord(_cp1252(m))
def _cp1252(m):
    c = unichr(ord(m.group(0)))
    return cp1252.get(c, c)

the above simply finds all non-ascii chrs and returns an html &#nn; 
escape string, if you want to return a quote char instead of u'u\2019' 
then have the rtn return cp1252.get(ord(m.group(0)) after modifying the 
dict cp1252 to your taste.
cp1252 = {
    # from http://www.microsoft.com/typography/unicode/1252.htm
    u"\x80": u"\u20AC", # EURO SIGN
    u"\x82": u"\u201A", # SINGLE LOW-9 QUOTATION MARK
    u"\x83": u"\u0192", # LATIN SMALL LETTER F WITH HOOK
    u"\x84": u"\u201E", # DOUBLE LOW-9 QUOTATION MARK
    u"\x85": u"\u2026", # HORIZONTAL ELLIPSIS
    u"\x86": u"\u2020", # DAGGER
    u"\x87": u"\u2021", # DOUBLE DAGGER
    u"\x88": u"\u02C6", # MODIFIER LETTER CIRCUMFLEX ACCENT
    u"\x89": u"\u2030", # PER MILLE SIGN
    u"\x8A": u"\u0160", # LATIN CAPITAL LETTER S WITH CARON
    u"\x8B": u"\u2039", # SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    u"\x8C": u"\u0152", # LATIN CAPITAL LIGATURE OE
    u"\x8E": u"\u017D", # LATIN CAPITAL LETTER Z WITH CARON
    u"\x91": u"\u2018", # LEFT SINGLE QUOTATION MARK
    u"\x92": u"\u2019", # RIGHT SINGLE QUOTATION MARK
    u"\x93": u"\u201C", # LEFT DOUBLE QUOTATION MARK
    u"\x94": u"\u201D", # RIGHT DOUBLE QUOTATION MARK
    u"\x95": u"\u2022", # BULLET
    u"\x96": u"\u2013", # EN DASH
    u"\x97": u"\u2014", # EM DASH
    u"\x98": u"\u02DC", # SMALL TILDE
    u"\x99": u"\u2122", # TRADE MARK SIGN
    u"\x9A": u"\u0161", # LATIN SMALL LETTER S WITH CARON
    u"\x9B": u"\u203A", # SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    u"\x9C": u"\u0153", # LATIN SMALL LIGATURE OE
    u"\x9E": u"\u017E", # LATIN SMALL LETTER Z WITH CARON
    u"\x9F": u"\u0178", # LATIN CAPITAL LETTER Y WITH DIAERESIS