preserving entities with lxml

Dieter Maurer dieter at handshake.de
Wed Jan 12 15:49:23 EST 2022


Robin Becker wrote at 2022-1-12 10:22 +0000:
>I have a puzzle over how lxml & entities should be 'preserved' code below illustrates. To preserve I change & --> &
>in the source and add resolve_entities=False to the parser definition. The escaping means we only have one kind of
>entity & which means lxml will preserve it. For whatever reason lxml won't preserve character entities eg !.
>
>The simple parse from string and conversion tostring shows that the parsing at least took notice of it.
>
>However, I want to create a tuple tree so have to use tree.text, tree.getchildren() and tree.tail for access.
>
>When I use those I expected to have to undo the escaping to get back the original entities, but it seems they are
>already done.
>
>Good for me, but if the tree knows how it was created (tostring shows that) why is it ignored with attribute access?
>
>if __name__=='__main__':
>     from lxml import etree as ET
>     #initial xml
>     xml = b'<a attr="&mysym; < & > !">aaaaa &mysym; < & > ! AAAAA</a>'
>     #escaped xml
>     xxml = xml.replace(b'&',b'&')
>
>     myparser = ET.XMLParser(resolve_entities=False)
>     tree = ET.fromstring(xxml,parser=myparser)
>
>     #use tostring
>     print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')
>
>     #now access the items using text & children & text
>     print(f'using attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')
>
>when run I see this
>
>$ python tmp/tlp.py
>using tostring
>xxml=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt;
>&#33; AAAAA</a>'
>ET.tostring(tree)=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp;
>&gt; &#33; AAAAA</a>'
>
>using attributes
>tree.text='aaaaa &mysym; < & > ! AAAAA'
>tree.getchildren()=[]
>tree.tail=None

Apparently, the `resolve_entities=False` was not effective: otherwise,
your tree content should have more structure (especially some
entity reference children).

`&#<value>` is not an entity reference but a character reference.
It may rightfully be treated differently from entity references.


More information about the Python-list mailing list