[XML-SIG] lxml - html entities

spencer.c spencer.crissman at gmail.com
Mon Jul 28 18:13:33 CEST 2008


I am using lxml to process some xhtml files.  The files have html character
codes embedded in them.  For instance: ' rather than a '.  When I parse
the files, edit them, and then write them back out, I want my edits to be
the only changes in the output files, but lxml is replacing the character
codes with the actual characters they are supposed to represent as well.

So if I have:
It& #39;s an example. <-- Space inserted to help readability.

It is writing out:
It's an example.  

I've tried setting resolve_entities to false, ala:
tree = etree.parse(input, etree.XMLParser(resolve_entities=False))

But this seems to have no effect.

There a way to tell lxml to ignore these/leave them as is?

Thanks.

-s
-- 
View this message in context: http://www.nabble.com/lxml---html-entities-tp18693223p18693223.html
Sent from the Python - xml-sig mailing list archive at Nabble.com.



More information about the XML-SIG mailing list