preserving entities with lxml

Robin Becker robin at reportlab.com
Thu Jan 13 04:13:43 EST 2022


On 12/01/2022 20:49, Dieter Maurer wrote:
.......
>>
>> when run I see this
>>
>> $ python tmp/tlp.py
>> using tostring
>> xxml=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt;
>> &#33; AAAAA</a>'
>> ET.tostring(tree)=b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp;
>> &gt; &#33; AAAAA</a>'
>>
>> using attributes
>> tree.text='aaaaa &mysym; < & > ! AAAAA'
>> tree.getchildren()=[]
>> tree.tail=None
> 
> Apparently, the `resolve_entities=False` was not effective: otherwise,
> your tree content should have more structure (especially some
> entity reference children).
> 
except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False 
does work.

I expected that the tree would contain the parsed (unexpanded) values, but referencing the actual tree.text/tail/attrib 
doesn't give the expected results. There's no criticism here, it makes my life a bit easier. If I had wanted the 
unexpanded values in the attrib/text/tail it would be more of a problem.


> `&#<value>` is not an entity reference but a character reference.
> It may rightfully be treated differently from entity references.
I understand the difference, but lxml (and perhaps libxml2) doesn't provide a way to turn off character reference 
expansion. This makes using lxml for source transformation a bit harder since the original text is not preserved.



More information about the Python-list mailing list