[XML-SIG] Is anyone implementing EXI in Python?

Fri Jul 17 10:06:01 CEST 2009

Hi,

Stanley A. Klein wrote:
> On Wed, 2009-07-15 at 22:26 +0200, Stefan Behnel wrote:
>> A well chosen compression method is a lot better suited to such
>> applications and is already supported by most available XML parsers (or
>> rather outside of the parsers themselves, which is a huge advantage).
> 
> It depends on the nature of the XML application.  One feature of EXI is to
> support representation of numeric data as bits rather than characters. 
> That is very useful in appropriate applications.

One drawback is that this requires a schema to make sure the number of bits
is sufficient. Otherwise, you'd need to add the information how many bits
you use for their representation, which would add to the data volume.

> There is a measurements
> document that shows the compression that was achieved on a wide variety of
> test cases.  Straight use of a common compression algorithm does not
> necessarily achieve the best results.

Repetitive data like an XML byte stream compresses extremely well, though,
and the 'best' compression isn't always required anyway. I worked on a
Python SOAP application where we sent some 3MB of XML as a web service
response. That took a couple of seconds to transmit. Injecting the standard
gzip algorithm into the WSGI stack got it down to some 48KB. Nothing more
to do here.

If you need 'the best' compression, there's no way around benchmarking a
couple of different algorithms that are suitable for your application, and
choosing the one that works best for your data. That may or may not include
EXI.

> Besides, EXI incorporates elements
> of common compression algorithm(s) as both a fallback for its schema-less
> mode and an additional capability in its schema-informed mode.

Makes sense, as compression also applies to text content, for example.

> EXI is intended for use outboard of the parser, and that would apply
> equally well to a Python version.  For example, EXI gets rid of the need
> to constantly resend over-the-wire all the namespace definitions with each
> message.  The relevant strings would just go into the string table and get
> restored from there when the message is converted back.

That's how any run-length based compression algorithm works anyway. Plus,
namespace definitions usually only happen once in a document, so they are
pretty much negligible in a larger XML document.

> However, for something like SOAP in certain applications, it may be
> eventually desirable to integrate the EXI implementation within the
> communications system.  The message sender could reasonably create a
> schema-informed EXI version without actually starting from and converting
> an XML object.  The recipient would have to convert the EXI back to XML,
> parse it, and use the data.

Ok, that's where I see it, too. At the level where you'd normally apply a
compression algorithm anyway.

> Numeric data is most efficiently sent as bits

Depends on how you select the bits. When I write into my schema that I use
a 32 bit integer value in my XML, and all I really send happens to be
within [0-9] in, say, 95% of the cases with a few exceptions that really
require 32 bits, a general run-length compression algorithm will easily
beat anything that sends the value as a 4-byte sequence. That's the
advantage of general compression: it sees the real data, not only its schema.

I do not question EXI in general, I'm fine with it having its niche
(wherever that turns out to be). I'm just saying that common compression
algorithms are a lot more broadly available and achieve similar results. So
EXI is just another way of compressing XML, with the disadvantage of not
being as widely implemented. Compare it to the ubiquity of the gzip
compression algorithm, for example. It's just the usual trade-off that you
make between efficiency and cross-platform compatibility.

Stefan