regular expressions, unicode and XML

ProvoWallis gshepherd281281 at yahoo.com
Thu Jan 26 02:07:16 EST 2006


Hi,

I'm hoping someone can help me. I'm hopelessly lost.

I'm trying to make a change in some XML files using a regular
expression (re.sub). I can capture the text I want to replace OK but
when I replace it end up with nothing: i.e., just a "" character in my
file.

data = re.sub(r'(?i)(?u)<title><emph typestyle=\"bf\">Sample
Title</emph></title><para indent=\"none\" runin=\"1\"><emph
typestyle=\"bf\">\&#x2014;(.*?):</emph>', '<title><icon
name="graphic"/>&#x2002;<emph typestyle="bf">Sample
Title&#x2014;\1:</emph></title><para indent="none" runin="1">', data)

I think my problem is that I don't understand unicode or even know how
my XML is encoded b/c there is nothing in the XML declaration at the
top of the file.

I'd be grateful if someone could give a little adive or point me in the
right direction. I've read abunch of stuff on the board but nothing
seems to click.I'm guessing I have to decode my file when I read it
something like this

raw = inputFile.read()
fileencoding = "utf-8"
data =  raw.decode(fileencoding)

and then write it out similarly but this doesn't seem to work.

Any help appreciated,

Greg




More information about the Python-list mailing list