Idempotent XML processing

Fri Aug 19 14:20:38 EDT 2005

Michael Ekstrand wrote:
> Hello all,
> 
> In my current project, I am working with XML data in a protocol that has
> checksum/signature verification of a portion of the document. There is 
> an envelope with a header element, containing signature data; following 
> the header is a body. The signatures are computed as cryptographic 
> checksums of the entire Body element, including start and end tags, 
> exactly as it appears in the data transmission.
> 
> Therefore, I need to extract the entire text of an element of an XML 
> document. I have a function that scans an XML string and does this, but 
> it seems like a rather clumsy way to accomplish this task. I've been 
> playing with xml.dom.minidom and its toxml() method, but to no avail - 
> the server sends me XML with empty elements as full open/close tags, 
> but toxml() serializes them to the XML empty element (<Element/>), so 
> the checksum winds up not matching.
> 
> Is there some parsing mechanism (using PyXML or any other freely usable 
> 3rd party library is an option) that will allow me to accomplish this? 
> Or am I best off sticking with my little string scanning function?

Read up on XML canonicalization (abrreviated as c14n). lxml implements 
this, also xml.dom.ext.c14n in PyXML. You'll need to canonicalize on 
both ends before hashing.

To paraphrase an Old Master, if you are running a cryptographic hash 
over a non-canonical XML string representation, then you are living in a 
state of sin.

-- 
Robert Kern
rkern at ucsd.edu

"In the fields of hell where the grass grows high
  Are the graves of dreams allowed to die."
   -- Richard Harter