xml.parsers.expat loading xml into a dict and whitespace

kaens apatheticagnostic at gmail.com
Wed May 23 02:15:00 EDT 2007


Ok, I can fix it by modifying

 if self.inOptions and self.curTag != "options":

to

if self.inOptions and self.curTag != "options" and self.curTag != ""

but this feels really freaking ugly.

Sigh.

Any suggestions? I know I must be missing something.

Also, I hate the tendency I have to figure stuff out shortly after
posting to a mailing list or forum. Happens all the time, and I swear
I don't solve stuff until I ask for help.

On 5/23/07, kaens <apatheticagnostic at gmail.com> wrote:
> Wait. . . it's because the curTag is set to "", thus it sets the
> whitespace after a tag to that part of the dict.
>
> That doesn't explain why it does it on a xml file containing no
> whitespace, unless it's counting newlines.
>
> Is there a way to just ignore whitespace and/or xml comments?
>
> On 5/23/07, kaens <apatheticagnostic at gmail.com> wrote:
> > Hey everyone, this may be a stupid question, but I noticed the
> > following and as I'm pretty new to using xml and python, I was
> > wondering if I could get an explanation.
> >
> > Let's say I write a simple xml parser, for an xml file that just loads
> > the content of each tag into a dict (the xml file doesn't have
> > multiple hierarchies in it, it's flat other than the parent node)
> >
> > so we have
> > <parent>
> >      <option1>foo</option1>
> >      <option2>bar</option2>
> >       . . .
> > </parent>
> >
> > (I'm using xml.parsers.expat)
> > the parser sets a flag that says it's in the parent, and sets the
> > value of the current tag it's processing in the start tag handler.
> > The character data handler sets a dictionary value like so:
> >
> > dictName[curTag] = data
> >
> > after I'm done processing the file, I print out the dict, and the first value is
> > <a few bits of whitespace> : <a whole bunch of whitespace>
> >
> > There are comments in the xml file - is this what is causing this?
> > There are also blank lines. .  .but I don't see how a blank line would
> > be interpreted as a tag. Comments though, I could see that happening.
> >
> > Actually, I just did a test on an xml file that had no comments or
> > whitespace and got the same behaviour.
> >
> > If I feed it the following xml file:
> >
> > <options>
> > <one>hey</one>
> > <two>bee</two>
> > <three>eff</three>
> > </options>
> >
> > it prints out:
> > " :
> >
> > three :  eff
> > two :  bee
> > one :  hey"
> >
> > wtf.
> >
> > For reference, here's the handler functions:
> >
> > def handleCharacterData(self, data):
> >      if self.inOptions and self.curTag != "options":
> >          self.options[self.curTag] = data
> >
> > def handleStartElement(self, name, attributes):
> >     if name == "options":
> >         self.inOptions = True
> >     if self.inOptions:
> >         self.curTag = name
> >
> >
> > def handleEndElement(self, name):
> >     if name == "options":
> >         self.inOptions = False
> >     self.curTag = ""
> >
> > Sorry if the whitespace in the code got mangled (fingers crossed...)
> >
>



More information about the Python-list mailing list