xml + mmap cross

Thu Sep 4 23:28:34 EDT 2008

On Sep 4, 7:54 pm, alex23 <wuwe... at gmail.com> wrote:
> On Sep 4, 8:31 am, castironpi <castiro... at gmail.com> wrote:
>
> > Any interest in pursuing/developing/working together on a mmaped-xml
> > class?  Faster, not readable in text editor.
>
> XML is text-based, so it should -always- be readable in a text editor.
> It's part of the definition, I believe.
>
> However, an implementation of one of the alternative binary XML
> formats would probably be very welcome.
>
> Fast Infoset:http://www.itu.int/rec/T-REC-X.891-200505-I/en
> EXI:http://www.w3.org/TR/2007/WD-exi-20070716/
>
> I don't know enough about either format to say if it would be
> possible, but an implementation that conformed to the ElementTree API
> could be a big win.

I was thinking something much less restrictive than the two links.
Since it's not text, I'm not sure it event counts as structured
markup.  More generic, something like hierarchical 'tag-content-child'
pairs.

Here's what the xml.etree.ElementTree API says:

Each element has a number of properties associated with it:

- a tag which is a string identifying what kind of data this element
represents (the element type, in other words).
- a number of attributes, stored in a Python dictionary.
- a text string.
- an optional tail string.
- a number of child elements, stored in a Python sequence

Since all of these would be buffer-based representations, the
attribute list would merely implement the mapping-object protocol, not
be in a true dictionary.  The strings would be stored as offsets to
length-prefixed buffer segments.

Each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

These are all integers representing offsets elsewhere into the map.

A short observation:

>>> a= e.XML( '<a><b>abc</b></a>' )
>>> a.getchildren()[0].text
'abc'
>>> a.getchildren()[0].text= 'ab<'
>>> e.tostring(a)
'<a><b>ab<</b></a>'
>>> e.XML(_)
<Element a at c2c3f0>
>>> _.getchildren()[0].text
'ab<'

The current implementation supports round trips between special
characters '<' and markup '<', which I propose to support as well.

Of course, you'd have to garbage collect removed nodes by hand, on any
deletions.

Also, poss. change subject to: ElementTree + mmap cross.