XML overuse? (was Re: Python to XML to Python conversion)
Huaiyu Zhu
huaiyu at gauss.almadan.ibm.com
Mon Jul 15 14:04:05 EDT 2002
holger krekel <pyth at devel.trillke.net> wrote:
>Huaiyu Zhu wrote:
>> Readability for machines does not have to come at the expense of readability
>> for humans. A few years back I experimented with an indentation based data
>> format that is:
>>
>> - as readable as emacs's outline mode
>> - reduce to common conventions like this paragraph for simple cases
>> - allow mixed nested structures of set, sequence, dictionary, and seqdict
>> - can include binary data
>> - can handle different encodings/encryptions in different elements
>> - with average less than 5% bloat, in contrast to XML's over 100% bloat
>
>do you have any code or design documents for this?
>
>Sounds quite interesting.
The basic idea is quite simple: consider a data structure as a tree; denote
the type of branching at each node; indent the subtrees. It appears to me
that indentation is easier to handle than quotes and escapes. Here's a
simple example:
[]
# This is a sequence
- first item
- second item
with multiple lines
-{}
# The third item in the sequence is itself a set
- element 1
-## encryption=somescheme
# element 2 is binary data
the binary data goes here
which can be multiple lines as well
-{:}
# element 3 is a dictionary
- key1: value1
- key2: value2
-[:]
# The third item in the sequence is itself a seqdict
- key1: value1
- key2:-
This value is multiline
Which keeps the same indentation
So that it is human readable
There is a complication that I cannot recall at this moment that requires
the indentation to be at least two characters.
The outermost level could be handled by blank lines to make it more
readable. So a bibtex type of file would be like
[]{:}
- bibkey: ...
- author: ...
- title: ...
- bibkey: ...
- author: ...
- title: ...
For deeply nested structures, it is more efficient but less readable to use
0)- a
1)- b
2)- c
3)- d
2)- e
in place of
- a
- b
- c
- d
- e
Assuming that the newline character occurs in binary data with 1/256
frequency, and assume that the structural denotations at the beginning of
each line occupies less than 10 characters, then the bloat factor for binary
data would be less than 5%.
OK, hope this makes sense. If this is still interesting I'll dig the thing
out. I have documents and code (perl and python) at home, but I'll have to
dig through several tar files to find them, maybe on a hard disk that's not
mounted. This all started back when I tried to use perl to manage my bibtex
files while I did not know Python, so some of them used % and @ to represent
hash and array following perl. The format itself also changed somewhat over
the years. So don't expect those to be more organized than this post. :-)
They certainly have more details, though.
Huaiyu
More information about the Python-list
mailing list