Trying to parse a HUGE(1gb) xml file

Tue Dec 28 06:56:31 EST 2010

On Tue, 2010-12-28 at 07:08 +0100, Stefan Behnel wrote:
> Roy Smith, 28.12.2010 00:21:
> > To go back to my earlier example of
> >          <Parental-Advisory>FALSE</Parental-Advisory>
> > using 432 bits to store 1 bit of information, stuff like that doesn't
> > happen in marked-up text documents.  Most of the file is CDATA (do they
> > still use that term in XML, or was that an SGML-ism only?).  The markup
> > is a relatively small fraction of the data.  I'm happy to pay a factor
> > of 2 or 3 to get structured text that can be machine processed in useful
> > ways.  I'm not willing to pay a factor of 432 to get tabular data when
> > there's plenty of other much more reasonable ways to encode it.
> If the above only appears once in a large document, I don't care how much 
> space it takes. If it appears all over the place, it will compress down to 
> a couple of bits, so I don't care about the space, either.

+1

> It's readability that counts here. Try to reverse engineer a binary format 
> that stores the above information in 1 bit.

I think a point many of the arguments against XML miss is the HR cost of
custom solutions.  Every time you come up with a cool super-efficient
solution it has to be weighed against the increase in the tool-stack
[whereas XML is, essentially, built-in] and
nobody-else-knows-about-your-super-cool-solution [1].  IMO, tool-stack
bloat is a *big* problem in shops with an Open Source tendency.  Always
tossing the new and shiny thing [it's free!] into the bucket for some
theoretical benefit. [This is an unrecognized benefit to expensive
software - it creates focus].  Soon the bucket is huge and maintaining
it becomes a burden.

[1] The odds you sufficiently documented your super-cool-solution is
probably nil.

So I'm one of those you'd have to make a *really* good argument *not* to
use XML.  XML is known, the tools are good, the knotty problems are
solved [thanks to the likes of SAX, lxml / ElementTree, and
ElementFlow].  If the premise argument is "bloat" I'd probably dismiss
it out of hand since removing that bloat will necessitate adding bloat
somewhere else; that somewhere else almost certainly being more
expensive.