[XML-SIG] Reading characters

Albert Chin xml-sig@python.org
Tue, 21 May 2002 18:33:38 -0500


On Wed, May 22, 2002 at 12:07:08AM +0200, Martin v. Loewis wrote:
> Albert Chin <xml-sig@thewrittenword.com> writes:
> 
> > I have an XML element that contains a lot of data (it's a base64
> > encoded file). Reading the characters through the default characters()
> > function is slow (one line at a time). How can I read more?
> > 
> > I'm using PyXML 0.7.1 and Python 2.2.1.
> 
> Can you elaborate? What kind of parsing technology do you use? Expat,
> SAX, minidom, something else?

I'm using expat to the best of my knowledge:
  from xml.sax import saxexts, saxlib

  ...

  fh = open (self.path, 'r')
  xmlh = read_pkg_db (self.data)
  p = saxexts.make_parser ()
  p.setDocumentHandler (xmlh)
  p.parseFile (fh)
  fh.close ()
  p.close ()

> What do you mean by "read through"? Can you share a bit of code?

  class read_pkg_db (saxlib.HandlerBase):
    def __init__ (self, data, extract = 0):
      self.in_data = 0
      self.data = data
      self.extract = extract      # whether or not to extract payload

    ...

    def startElement (self, name, attrs):

    ...

    def characters (self, ch, start, length):
      if self.in_data and self.extract:
        self.payload = self.payload + ch[start:start+length]

    ...

Sample XML file:

<?xml version="1.0"?>
<packages>
  <package name="m4" version="1.4">
    <package-manager name="pkgadd">
      <datetime type="create">2002-05-21T23:22:05Z</datetime>
      <install-name>m4</install-name>
      <pkgname-base>TWWm4</pkgname-base>
      <version>1.4</version>
      <revision>5</revision>
      <subpkg type="man">m</subpkg>
      <subpkg type="runtime">u</subpkg>
      <data checksum="445f09c2f585c9d5a5c7c8dff2ea8275"
        checksum-type="md5" encoding="base64"
        filename="data.gcpio.bz2" size="104123">
QlpoOTFBWSZTWaGX1TsBL+V/////////////////////////////////////////////4fZgDz61
ABSQpQAAbb1tvuilfS9gyp31kPqW9nX3b7Z8RSmb0zz3wAEQkAeA+xhAXIADtgKp9OgABoAAFAoA
...
      </data>
    </package-manager>
  </package>
</packages>

-- 
albert chin (china@thewrittenword.com)