XML overuse? (was Re: Python to XML to Python conversion)

Mon Jul 15 14:04:05 EDT 2002

holger krekel <pyth at devel.trillke.net> wrote:
>Huaiyu Zhu wrote:
>> Readability for machines does not have to come at the expense of readability
>> for humans.  A few years back I experimented with an indentation based data
>> format that is:
>> 
>> - as readable as emacs's outline mode
>> - reduce to common conventions like this paragraph for simple cases
>> - allow mixed nested structures of set, sequence, dictionary, and seqdict
>> - can include binary data 
>> - can handle different encodings/encryptions in different elements
>> - with average less than 5% bloat, in contrast to XML's over 100% bloat
>
>do you have any code or design documents for this?  
>
>Sounds quite interesting.

The basic idea is quite simple: consider a data structure as a tree; denote
the type of branching at each node; indent the subtrees.  It appears to me
that indentation is easier to handle than quotes and escapes.  Here's a
simple example:

[]
# This is a sequence
- first item
- second item
  with multiple lines
-{}
 # The third item in the sequence is itself a set
 - element 1
 -## encryption=somescheme
  # element 2 is binary data
   the binary data goes here
   which can be multiple lines as well
 -{:}
  # element 3 is a dictionary
  - key1: value1
  - key2: value2
-[:]
 # The third item in the sequence is itself a seqdict
 - key1: value1
 - key2:-
   This value is multiline
   Which keeps the same indentation
   So that it is human readable

There is a complication that I cannot recall at this moment that requires
the indentation to be at least two characters.  

The outermost level could be handled by blank lines to make it more
readable.  So a bibtex type of file would be like

[]{:}

- bibkey: ...
- author: ...
- title: ...

- bibkey: ...
- author: ...
- title: ...

For deeply nested structures, it is more efficient but less readable to use

0)- a
1)- b
2)- c
3)- d
2)- e

in place of 

- a
 - b
  - c
   - d
  - e

Assuming that the newline character occurs in binary data with 1/256
frequency, and assume that the structural denotations at the beginning of
each line occupies less than 10 characters, then the bloat factor for binary
data would be less than 5%.

OK, hope this makes sense.  If this is still interesting I'll dig the thing
out.  I have documents and code (perl and python) at home, but I'll have to
dig through several tar files to find them, maybe on a hard disk that's not
mounted.  This all started back when I tried to use perl to manage my bibtex
files while I did not know Python, so some of them used % and @ to represent
hash and array following perl.  The format itself also changed somewhat over
the years.  So don't expect those to be more organized than this post. :-)
They certainly have more details, though.

Huaiyu