Trying to parse a HUGE(1gb) xml file

Roy Smith roy at panix.com
Mon Dec 27 18:21:32 EST 2010


Alan Meyer <ameyer2 at yahoo.com> wrote:

> On 12/26/2010 3:15 PM, Tim Harig wrote:
> I agree with you but, as you say, it has become a defacto standard.  As 
> a result, we often need to use it unless there is some strong reason to 
> use something else.

This is certainly true.  In the rarified world of usenet, we can all 
bash XML (and I'm certainly front and center of the XML bashing crowd).  
In the real world, however, it's a necessary evil.  Knowing how to work 
with it (at least to some extent) should be in every software engineer's 
bag of tricks.

> The same thing can be said about relational databases.  There are 
> applications for which a hierarchical database makes more sense, is more 
> efficient, and is easier to understand.  But anyone who recommends a 
> database that is not relational had better be prepared to defend his 
> choice with some powerful reasoning because his management, his 
> customers, and the other programmers on his team are probably going to 
> need a LOT of convincing.

This is also true.  In the old days, they used to say, "Nobody ever got 
fired for buying IBM".  Relational databases have pretty much gotten to 
that point.  Suits are comfortable with Oracle and MS SqlServer, and 
even MySQL.  If you want to go NoSQL, the onus will be on you to 
demonstrate that it's the right choice.

Sometimes, even when it is the right choice, it's the wrong choice.  You 
typically have a limited amount of influence capital to spend, and many 
battles to fight.  Sometimes it's right to go along with SQL, even if 
you know it's wrong from a technology point of view, simply because 
taking the easy way out on that battle may let you devote the energy you 
need to win more important battles.

And, anyway, when your SQL database becomes the bottleneck, you can 
always go back and say, "I told you so".  Trust me, if you're ever 
involved in an "I told you so" moment, you really want to be on the 
transmitting end.

> And of course there are many applications where XML really is the best. 
> It excels at representing complex textual documents while still 
> allowing programmatic access to individual items of information.

Yup.  For stuff like that, there really is no better alternative.  To go 
back to my earlier example of

        <Parental-Advisory>FALSE</Parental-Advisory>

using 432 bits to store 1 bit of information, stuff like that doesn't 
happen in marked-up text documents.  Most of the file is CDATA (do they 
still use that term in XML, or was that an SGML-ism only?).  The markup 
is a relatively small fraction of the data.  I'm happy to pay a factor 
of 2 or 3 to get structured text that can be machine processed in useful 
ways.  I'm not willing to pay a factor of 432 to get tabular data when 
there's plenty of other much more reasonable ways to encode it.



More information about the Python-list mailing list