Finding a \u0096

Wed Dec 4 10:01:44 EST 2002

u'\u0096' # unicode single-character string
'\u0096' # string with 5 characters

What you're trying to do (if I understand correctly) is to scan for 
utf-8-encoded unicode code-points in regular strings.  I'm not sure 
that's particularly safe (why not just convert to unicode, do the 
replace, then re-encode in utf-8?), but here's what you'd do if you 
really want to do this...

    text = string.replace(text, u'\u0096'.encode('utf-8'), '&#x2013;')

I'd suggest doing:

    text = text.decode( 'utf-8')
    text = text.replace( u'\u0096', '&#x2013;')
    text = text.encode( 'utf-8')

instead.

HTH,
Mike

Gustaf Liljegren wrote:

>I'm using Python to automate some mechanisms in a Word to XML
>conversion. The XML file should be encoded in UTF-8. Since Word is
>using Microsoft's "ANSI" character set and I want Unicode in UTF-8,
>some characters need to be replaced. All these characters reside in
>the C1 interval in Unicode (i.e. between DEL and NBSP in Latin 1).
>
>When I try to replace these characters,
>
>  text = string.replace(text, '\u0096', '&#x2013;')  # En dash
>
>Python doesn't recognize them. I have to write it in Greek to get
>Python to understand what I mean:
>
>  text = string.replace(text, 'â€“', '&#x2013;')  # En dash
>
>The first quote is: 'a' with circumflex, 'euro' and a right-slanted
>double quote. It works, but it's ugly. Isn't there a better way to
>write this?
>
>Gustaf
>
>  
>

-- 
_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/