Writing big XML files where beginning depends on end.

Laurent Pointal laurent.pointal at limsi.fr
Mon Nov 28 07:34:30 EST 2005


Magnus Lycka wrote:
> We're using DOM to create XML files that describes fairly
> complex calculations. The XML is structured as a big tree,
> where elements in the beginning have values that depend on
> other values further down in the tree. Imagine something
> like below, but much bigger and much more complex:
> 
> <node sum="15">
>     <node sum="10">
>         <leaf>7</leaf'>
>         <node sum="3">
>             <leaf>2</leaf'>
>             <leaf>1</leaf>
>         </node>
>     </node>
>     <node sum="5">
>         <leaf>5</leaf>
>     </node>
> </node>
> 
> We have to stick with this XML structure for now.
> 
> In some cases, building up a DOM tree in memory takes up
> several GB of RAM, which is a real showstopper. The actual
> file is maybe a magnitute smaller than the DOM tree. The
> app is using libxml2. It's actually written in C++. Some
> library that used much less memory overhead could be
> sufficient.
> 
> We've thought of writing a file that looks like this...
> 
> <node sum="#1">
>     <node sum="#1.1">
>         <leaf>7</leaf'>
>         <node sum="#1.1.1">
>             <leaf>2</leaf'>
>             <leaf>1</leaf>
>         </node>
>     </node>
>     <node sum="#1.2">
>         <leaf>5</leaf>
>     </node>
> </node>
> 
> ...and store {"#": "15", "#1.1", "10" ... } in a map
> and then read in a piece at a time and performs some
> simple change and replace to get the correct values in.
> Then we need something that allows parts of the XML file
> to be written to file and purged from RAM to avoid the
> memory problem.
> 
> Suggestions for solutions are appreciated.

An idea.

Put spaces in your names <node sum="#1          ">
Store { "#1": <its offset in the result file> }.

When you have collected all your final data, go throught your stored #,
seek at their position in the file, and replace the data (writting same
amount of chars than reserved).

A kind of direct access to nodes in an XML document file.

A+

Laurent.



More information about the Python-list mailing list