[XML-SIG] problem with elementtree 1.2.6

Chris Withers chris at simplistix.co.uk
Tue Nov 27 11:48:44 CET 2007


Fredrik Lundh wrote:
>> Sorry if this should go to a list, I couldn't find one...
>> (please send me that way if there is one...)
> 
> python-list/comp.lang.python or xml-sig are good choices.

OK, lets go with xml-sig :)

>> I've bumped into an annoying problem, which I actually think is a
>> problem with expat:
>>
>>  >>> from xml.parsers import expat
>>  >>> parser = expat.ParserCreate()
>>  >>> def handle(data): print repr(data)
>> ...
>>  >>> parser.CharacterDataHandler = handle
>>  >>> parser.Parse('<xml>&lt;node/&gt;</xml>',0)
>> u'<'
>> u'node/'
>> u'>'
>> 1
>>
>> Now, why is expat unquoting those two entities?
> 
> in an XML file, the characters < and & *must* be escaped (either as
> entity references or character references) when appearing in normal
> text:

Yes indeed.

> the following entities are predefined: &amp; (&) &lt; (<) &gt; (>)
> &quot; (") &apos; ('). 

Okay, so in the above, if I really mean &lt;, the xml should be:
'<xml>&amp;lt;/&amp;gt;</xml>'

Seems a little clunky, but okay...

I guess this was causing me problems as I'm working on a bug in Twiddler 
(http://www.simplistix.co.uk/software/python/twiddler)
where quoted html was ending up unquoted after processing:

 >>> from twiddler import Twiddler
 >>> t = Twiddler('<span>&lt;b&gt;</span>')
 >>> t.render()
u'<span><b></span>'

Now, I see how you fixed this in ElementTree by re-escaping all the 
predefined entities (out of interest, why is the funtion called 
_escape_cdata rather than _escape_data?) but I can't do that because I 
want uses to be able to insert chunks of html and choose whether or not 
they are escaped:

 >>> t = Twiddler('<span id="something"/>')

escaping:

 >>> t['something'].replace('<b>')
 >>> t.render()
u'<span id="something">&lt;b&gt;</span>'

no escaping:

 >>> t['something'].replace('<b>',filters=())
 >>> t.render()
u'<span id="something"><b></span>'

I guess in my use of ElementTree, I need to make sure character data is 
re-escaped at the tree building stage?

> other names give an error unless they've been
> explicitly defined.

So I see:

 >>> from xml.parsers import expat
 >>> parser = expat.ParserCreate()
 >>> parser.Parse('<xml>&foo;</xml>',0)
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
xml.parsers.expat.ExpatError: undefined entity: line 1, column 5

But why does calling UseForeignDTD suddenly make everything ok?

 >>> parser = expat.ParserCreate()
 >>> parser.UseForeignDTD()
 >>> parser.Parse('<xml>&foo;</xml>',0)
1

What extra hooks get called as a result of calling UseForeignDTD?

cheers,

Chris

-- 
Simplistix - Content Management, Zope & Python Consulting
            - http://www.simplistix.co.uk


More information about the XML-SIG mailing list