[XML-SIG] lxml - html entities

Tue Jul 29 07:43:28 CEST 2008

(this is being discussed on the lxml mailing list)

spencer.c wrote:
> I am using lxml to process some xhtml files.  The files have html character
> codes embedded in them.  For instance: &#39; rather than a '.  When I parse
> the files, edit them, and then write them back out, I want my edits to be
> the only changes in the output files, but lxml is replacing the character
> codes with the actual characters they are supposed to represent as well.
> 
> So if I have:
> It& #39;s an example. <-- Space inserted to help readability.
> 
> It is writing out:
> It's an example.  
> 
> I've tried setting resolve_entities to false, ala:
> tree = etree.parse(input, etree.XMLParser(resolve_entities=False))
> 
> But this seems to have no effect.
> 
> There a way to tell lxml to ignore these/leave them as is?
> 
> Thanks.
> 
> -s