[Tutor] xml parsing from xml

Wed May 7 21:49:46 CEST 2014

Neil D. Cerutti, 07.05.2014 20:04:
> On 5/7/2014 1:39 PM, Alan Gauld wrote:
>> On 07/05/14 17:56, Stefan Behnel wrote:
>>> Alan Gauld, 07.05.2014 18:11:
>>>> and ElementTree (aka etree). The documenation gives examples of both.
>>>> sax is easiest and fastest for simple XML in big files ...
>>>
>>> I wouldn't say that SAX qualifies as "easiest". Sure, if the task is
>>> something like "count number of abc tags" or "find tag xyz and get an
>>> attribute value from it", then SAX is relatively easy and also quite
>>> fast.
>>
>> That's pretty much what I said. simple task, big file. sax is easy.
>>
>> For anything else use etree.
>>
>>> BTW, ElementTree also has a SAX-like parsing mode, but comes with a
>>> simpler interface and saner parser configuration defaults.
>>
>> My experience was different. Etree is powerful but for simple
>> tasks I just found sax easier to grok. (And most of my XML parsing
>> is limited to simple extraction of a field or two.)
> 
> If I understand this task correctly it seems like a good application for
> SAX. As a state machine it could have a mere two states, assuming we aren't
> troubled about the parent nodes of Country tags.

Yep, that's the kind of thing I meant. You get started, just trying to get
out one little field out of the file, then notice that you need another
one, and eventually end up writing a page full of code where a couple of
lines would have done the job. Even just safely and correctly getting the
text content of an element is surprisingly non-trivial in SAX.

It's still unclear what the OP wanted exactly, though. To me, it read more
like the task was to copy some content over from one XML file to another,
in which case doing it in ET is just trivial thanks to the tree API, but
SAX requires you to reconstruct the XML brick by brick here.

> In my own personal case, I partly prefer xml.sax simply because it ignores
> namespaces, a nice benefit in my cases. I wish I could make ElementTree do
> that.

The downside of namespace unaware parsing is that you never know what you
get. It works for some input, but it may also just fail arbitrarily, for
equally valid input.

One cool thing about ET is that it makes namespace aware processing easy by
using fully qualified tag names (one string says it all).  Most other XML
tools (including SAX) require some annoying prefix mapping setup that you
have to carry around in order to tell the processor that you are really
talking about the thing that it's showing to you.

Stefan