Trying to parse a HUGE(1gb) xml file

Stefan Sonnenberg-Carstens stefan.sonnenberg at pythonmeister.com
Sat Dec 25 15:51:16 EST 2010


Am 25.12.2010 20:41, schrieb Roy Smith:
> In article<mailman.285.1293297695.6505.python-list at python.org>,
>   Adam Tauno Williams<awilliam at whitemice.org>  wrote:
>
>> XML works extremely well for large datasets.
> Barf.  I'll agree that there are some nice points to XML.  It is
> portable.  It is (to a certain extent) human readable, and in a pinch
> you can use standard text tools to do ad-hoc queries (i.e. grep for a
> particular entry).  And, yes, there are plenty of toolsets for dealing
> with XML files.
>
> On the other hand, the verbosity is unbelievable.  I'm currently working
> with a data feed we get from a supplier in XML.  Every day we get
> incremental updates of about 10-50 MB each.  The total data set at this
> point is 61 GB.  It's got stuff like this in it:
>
>          <Parental-Advisory>FALSE</Parental-Advisory>
>
> That's 54 bytes to store a single bit of information.  I'm all for
> human-readable formats, but bloating the data by a factor of 432 is
> rather excessive.  Of course, that's an extreme example.  A more
> efficient example would be:
>
>          <Id>1173722</Id>
>
> which is 26 bytes to store an integer.  That's only a bloat factor of
> 6-1/2.
>
> Of course, one advantage of XML is that with so much redundant text, it
> compresses well.  We typically see gzip compression ratios of 20:1.
> But, that just means you can archive them efficiently; you can't do
> anything useful until you unzip them.
Sending complete SQLite databases is absolute perfect.
For example Fedora uses (used?) this for their yum catalog updates.
Download to the right place, point your tool to it, ready.




More information about the Python-list mailing list