xml.parsers.expat loading xml into a dict and whitespace

kaens apatheticagnostic at gmail.com
Wed May 23 02:07:03 EDT 2007


Wait. . . it's because the curTag is set to "", thus it sets the
whitespace after a tag to that part of the dict.

That doesn't explain why it does it on a xml file containing no
whitespace, unless it's counting newlines.

Is there a way to just ignore whitespace and/or xml comments?

On 5/23/07, kaens <apatheticagnostic at gmail.com> wrote:
> Hey everyone, this may be a stupid question, but I noticed the
> following and as I'm pretty new to using xml and python, I was
> wondering if I could get an explanation.
>
> Let's say I write a simple xml parser, for an xml file that just loads
> the content of each tag into a dict (the xml file doesn't have
> multiple hierarchies in it, it's flat other than the parent node)
>
> so we have
> <parent>
>      <option1>foo</option1>
>      <option2>bar</option2>
>       . . .
> </parent>
>
> (I'm using xml.parsers.expat)
> the parser sets a flag that says it's in the parent, and sets the
> value of the current tag it's processing in the start tag handler.
> The character data handler sets a dictionary value like so:
>
> dictName[curTag] = data
>
> after I'm done processing the file, I print out the dict, and the first value is
> <a few bits of whitespace> : <a whole bunch of whitespace>
>
> There are comments in the xml file - is this what is causing this?
> There are also blank lines. .  .but I don't see how a blank line would
> be interpreted as a tag. Comments though, I could see that happening.
>
> Actually, I just did a test on an xml file that had no comments or
> whitespace and got the same behaviour.
>
> If I feed it the following xml file:
>
> <options>
> <one>hey</one>
> <two>bee</two>
> <three>eff</three>
> </options>
>
> it prints out:
> " :
>
> three :  eff
> two :  bee
> one :  hey"
>
> wtf.
>
> For reference, here's the handler functions:
>
> def handleCharacterData(self, data):
>      if self.inOptions and self.curTag != "options":
>          self.options[self.curTag] = data
>
> def handleStartElement(self, name, attributes):
>     if name == "options":
>         self.inOptions = True
>     if self.inOptions:
>         self.curTag = name
>
>
> def handleEndElement(self, name):
>     if name == "options":
>         self.inOptions = False
>     self.curTag = ""
>
> Sorry if the whitespace in the code got mangled (fingers crossed...)
>



More information about the Python-list mailing list