Splitting a DOM

Alan Kennedy alanmk at hotmail.com
Thu Feb 12 13:53:38 EST 2004


[Brice Vissi?re]
> But I'm pretty sure that there's a better way to split the DOM ?

There's *lots* of ways to solve this one. The "best" solution depends
on which criteria you choose.

The most efficient in time and memory is probably SAX, although the
problem is so simple, a simple textual solution might work well, and
would definitely be faster.

Here's a bit of SAX code adapted from another SAX example I posted
earlier today. Note that this will not work properly if you have
<STEP> elements nested inside one another. In that case, you'd have to
maintain a stack of the output files: push the outfile onto the stack
in "startElement()" and pop it off in "endElement()".

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax
from xml.sax.saxutils import escape, quoteattr
import cStringIO as StringIO

split_on_elems = ['STEP']

class splitter(xml.sax.handler.ContentHandler):

  def __init__(self):
    xml.sax.handler.ContentHandler.__init__(self)
    self.outfile = None
    self.seq_no = self.seq_no_gen()

  def seq_no_gen(self, n=0):
    while True: yield n ; n = n+1

  def startElement(self, elemname, attrs):
    if elemname in split_on_elems:
      self.outfile = open('step%04d.xml' % self.seq_no.next(), 'wt')
    if self.outfile:
      attrstr = ""
      for a in attrs.keys():
        attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
quoteattr(attrs[a])))
      self.outfile.write("<%s%s>" % (elemname, attrstr))

  def endElement(self, elemname):
    if self.outfile: self.outfile.write('</%s>' % elemname)
    if elemname in split_on_elems:
      self.outfile.close() ; self.outfile = None

  def characters(self, s):
    if self.outfile: self.outfile.write("%s" % (s,))

testdoc = """
<ROOT>
  <STEP a="b" c="d">Step 0</STEP>
  <STEP>Step 1</STEP>
  <STEP>Step 2</STEP>
  <STEP>Step 3</STEP>
  <STEP>Step 4</STEP>
</ROOT>
"""

if __name__ == "__main__":
  parser = xml.sax.make_parser()
  PFJ = splitter()
  parser.setContentHandler(PFJ)
  parser.setFeature(xml.sax.handler.feature_namespaces, 0)
  parser.feed(testdoc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

HTH,

-- 
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/contact/alan



More information about the Python-list mailing list