preserving entities with lxml

Robin Becker robin at reportlab.com
Wed Jan 12 05:22:22 EST 2022


I have a puzzle over how lxml & entities should be 'preserved' code below illustrates. To preserve I change & --> & 
in the source and add resolve_entities=False to the parser definition. The escaping means we only have one kind of 
entity & which means lxml will preserve it. For whatever reason lxml won't preserve character entities eg !.

The simple parse from string and conversion tostring shows that the parsing at least took notice of it.

However, I want to create a tuple tree so have to use tree.text, tree.getchildren() and tree.tail for access.

When I use those I expected to have to undo the escaping to get back the original entities, but it seems they are 
already done.

Good for me, but if the tree knows how it was created (tostring shows that) why is it ignored with attribute access?

if __name__=='__main__':
     from lxml import etree as ET
     #initial xml
     xml = b'<a attr="&mysym; < & > !">aaaaa &mysym; < & > ! AAAAA</a>'
     #escaped xml
     xxml = xml.replace(b'&',b'&')

     myparser = ET.XMLParser(resolve_entities=False)
     tree = ET.fromstring(xxml,parser=myparser)

     #use tostring
     print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')

     #now access the items using text & children & text
     print(f'using attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')

when run I see this

$ python tmp/tlp.py
using tostring
xxml=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt; 
&#33; AAAAA</a>'
ET.tostring(tree)=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; 
&gt; &#33; AAAAA</a>'

using attributes
tree.text='aaaaa &mysym; < & > ! AAAAA'
tree.getchildren()=[]
tree.tail=None
-- 
Robin Becker


More information about the Python-list mailing list