[XML-SIG] unicode problems in elementtree

Bryan Lawrence b.n.lawrence at rl.ac.uk
Fri May 26 22:22:41 CEST 2006


Hi Folks

elementtree is barfing (well to be correct, expat is barfing) with some 
unicode strings I'm passing through to it ... 

eg:
self = <ElementTree.XMLTreeBuilder instance>, self._parser = 
<pyexpat.xmlparser object>, self._parser.Parse = <built-in method Parse of 
pyexpat.xmlparser object>, data = 
u'<DIF><Entry_ID>badc.nerc.ac.uk:DIF:NM_HiGEM_yaao...on_Date>2005-02-03</Last_DIF_Revision_Date></DIF>'
  ExpatError: not well-formed (invalid token): line 1, column 11389 
      args = ('not well-formed (invalid token): line 1, column 11389',) 
      code = 4 
      lineno = 1 
      offset = 11389

For the record, we find [3 <= tau ]in that block ... we also have problem with 
degree symbols and whatever ..

I suspect the problem is that I'm not actually passing an xml document (with a 
character encoding definition) to ET ... I'm just passing some stuff which is 
an xml fragment (from a web service interface to a database).

Does elementtree and/or expat need to know the encoding to get this right? 
(which may be a problem coz this could be from anyone's document in any 
encoding ...)

(Sorry, I'm a bit unicode illiterate, and while I appreciate it's something I 
should know, there is other stuff filling my mind at the moment ...)

Bryan


More information about the XML-SIG mailing list