Finding a \u0096
Mike C. Fletcher
mcfletch at rogers.com
Wed Dec 4 10:01:44 EST 2002
u'\u0096' # unicode single-character string
'\u0096' # string with 5 characters
What you're trying to do (if I understand correctly) is to scan for
utf-8-encoded unicode code-points in regular strings. I'm not sure
that's particularly safe (why not just convert to unicode, do the
replace, then re-encode in utf-8?), but here's what you'd do if you
really want to do this...
text = string.replace(text, u'\u0096'.encode('utf-8'), '–')
I'd suggest doing:
text = text.decode( 'utf-8')
text = text.replace( u'\u0096', '–')
text = text.encode( 'utf-8')
instead.
HTH,
Mike
Gustaf Liljegren wrote:
>I'm using Python to automate some mechanisms in a Word to XML
>conversion. The XML file should be encoded in UTF-8. Since Word is
>using Microsoft's "ANSI" character set and I want Unicode in UTF-8,
>some characters need to be replaced. All these characters reside in
>the C1 interval in Unicode (i.e. between DEL and NBSP in Latin 1).
>
>When I try to replace these characters,
>
> text = string.replace(text, '\u0096', '–') # En dash
>
>Python doesn't recognize them. I have to write it in Greek to get
>Python to understand what I mean:
>
> text = string.replace(text, '–', '–') # En dash
>
>The first quote is: 'a' with circumflex, 'euro' and a right-slanted
>double quote. It works, but it's ugly. Isn't there a better way to
>write this?
>
>Gustaf
>
>
>
--
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list