xml + mmap cross

castironpi castironpi at gmail.com
Fri Sep 5 14:43:35 EDT 2008


On Sep 5, 12:13 am, Stefan Behnel <stefan... at behnel.de> wrote:
> Hi,
>
> this discussion seems pretty much off-topic for a Python list.
>
>
>
> castironpi wrote:
> > In an XML file, entries are stored in serial, sort of like this.
>
> > AAA BBB CCC DDD
>
> > Or more recognizably,
>
> > <A><B><C>something</C><D>something</D></B></A>
>
> > Point is, to change <C>something</C> to <C>something else</C>, you
> > have to recopy everything after that.
>
> > AAA BBB CCC DDD
> > AAA BBBb CCC DDD
>
> > requires 7 writes, 'b CCC DDD', not 1.
>
> > I want to use a simple tree structure to store:
>
> > 0 A-> None, 1
> > 1 B-> None, 2
> > 2 C-> 3, None
> > 3 D-> None, None
>
> > Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
> > First Child'.
>
> Do I understand you right: you want to access serialised XML data

alex23 correctly pointed out last night that XML is always
serialized.  What I mean and possibly what you mean is, a serialized
tree structure.

> as a memory
> mapped file and operate directly on the byte data?

Yes.  Byte data containing both strings, and the structure of the
tree.

> You would still have to
> decode the complete byte sequence to parse it and build your index structure
> in that case.

No, it's saved in an index structure; it's already in one.  To find a/
b/c, read a, read its children to b, read b, read its children to c,
read c.

> How do you plan to store the pointers to a node's next sibling/child? And how
> do you keep them updated over insertions/deletions?

You update them when you insert them.  To add 'c' to a/b, allocate c,
initialize c, read a, read its children to b, read b, read its
children to the end, add c.

When you delete c from a/b/c, however, any references to c that you
have in any programs become invalid.  Don't delete it if you have
them.  The bytes are marked in the file to be no longer in use, which
marking takes up some of the space in the file.

> That, plus the copy
> overhead in a sequential file, will be very costly on each change.

No.  That's the point of a dynamic structure.  <abc><def> are not
stored in memory as 10 consecutive characters.  The file is not
strictly speaking XML.  See below for what they're stored as.

> > You get constant time updates to contents, and log-time searches.
>
> Every XML tree structure gives you log-time searches. But how do you achieve
> constant time updates in a sequential file?

You don't use a sequential file.

> Stefan

I stated earlier that each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

All these fields are integers that contain an offset into the file.

Simplified:

0 Reserved
1 A.Tag 7 (points to 7 below)
2 A.FirstChild 4 (etc.)
3 A.Contents 0 (None)
4 B.Tag 11
5 B.FirstChild 0
6 B.Contents 15
7 3abc
11 3def
15 5ghijk

This translates to:

<abc><def>ghijk</def></abc>

But isn't stored that way.

The records are all the same size.  The 'Tag' field of a record that
starts at offset J is at offset J.  The 'FirstChild' field of a record
that starts at offset K is at offset K+ 1.

'tag_offset', 'text_offset', 'key_offset', and 'value_offset' contain
offsets of variable-length strings into the file ( 7, 11, 15 ).  The
rest contain offsets of further structures.

There's extra space usage not only in the structure of the tree, but
in the record-keeping of what bytes are available for use, and which
are in use.  You have to grow the file size yourself (like growing an
array) when you need to; the file system won't for you in mmap.  (This
means the alloc-free module I'm looking at will need modifications.)

I'll reemphasize the value of constant-time insertions to a file
though.



More information about the Python-list mailing list