[Python-bugs-list] [ python-Bugs-467384 ] provide a documented serialization func

noreply@sourceforge.net noreply@sourceforge.net
Fri, 12 Oct 2001 20:08:34 -0700


Bugs item #467384, was opened at 2001-10-02 19:25
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=467384&group_id=5470

Category: None
Group: Feature Request
Status: Open
Resolution: None
Priority: 5
Submitted By: paul rubin (phr)
Assigned to: Nobody/Anonymous (nobody)
Summary: provide a documented serialization func

Initial Comment:
It would be nice if there was a documented library
function for serializing Python basic objects
(numbers, strings, dictionaries, and lists).
By documented I mean the protocol is specified in
the documentation, precisely enough to write
interoperating implementations in other languages.

Code-wise, the marshal.dumps and loads functions do
what I want, but their data format is (according to the
documentation) intentionally not specified, because
the format might change in future Python versions.
Maybe that doc was written long enough ago that it's
ok to freeze the marshal format now, and document it?
I just mean for the basic types listed above.  Stuff
like code objects don't have to be specified.  In
fact it would be nice if there was a flag to the
loads and dumps functions to refuse to marshal/
unmarshal those objects.

Pickle/cpickle aren't really appropriate for what I'm
asking, since they're complicated (they try to handle
class instances, circular structure, etc.) and anyway
they're not documented either.

The XDR library is sort of ok, but it's written in
Python (i.e. slow) and it doesn't automatically
handle compound objects.

Thanks



----------------------------------------------------------------------

>Comment By: paul rubin (phr)
Date: 2001-10-12 20:08

Message:
Logged In: YES 
user_id=72053

Skip - C has struct objects which are sort of like Python
dictionaries.  XMLRPC represents structs as name-value
pairs, for example.  And "other languages" doesn't
necessarily mean C.  The marshaller should be able to
represent the non-Python-specific serializable objects,
not just scalars.
Basically this means strings, integers (of any length),
dictionaries, lists, and floats (hmm--unicode?), but
not necessarily stuff like code objects.

Having an independent marshal library is ok, I guess,
though I don't feel it's necessary to create more
implementation work.  And one the benefit of using
the existing marshaller is that it's already available in
most versions of Python that people are running
(Red Hat 7.1 still comes with Python 1.5 for example).

Tim - yes, I'm originally used a binascii.hexlify hack
similar to yours and it worked ok, but it was ugly.  I
also had to handle strings (generate a length count
followed by the contents) and then dictionaries (name-value
pairs) and finally felt I shouldn't need to rewrite the
marshaller like that.  There's already a built-in library
function that does everything I need, very efficiently in
native code, in one call, and being able to use it is in
the "batteries included" spirit.  

Also, the current long int marshalling format
is just a digit count (16-bit digits) followed by the digits
in binary.  If the digit width changes, the marshalling
format doesn't have to change--the marshalling code should
still be able to use the same external representation 
without excessive contortions and without slowing down.
(You'll see that it's already not a simple memory dump,
but a structure read and written one byte at a time through
layers of subroutines).  Changing widths while keeping the
old format means putting a minor kludge in the marshalling
code, but no user will ever notice it.

As for the speed of Python longs,
my stuff's runtime is dominated by modular exponentiations
<wink> and I'm already using gmpy for those when it's 
available (but I don't depend on it).  The speedup with
gmpy is substantial, but the speed with ordinary Python
longs is quite acceptable on my PIII (the StrongARM is
another story--probably the C compiler's fault).

Examining Python/marshal.c, I don't see any objects of
the types I've mentioned that are likely to need to change
representations--do you?  

Btw I notice that the pickle module represents long ints
as decimal strings even in "binary" mode, but I'll resist
opening another bug for that, for now.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2001-10-12 14:41

Message:
Logged In: YES 
user_id=44345

If you head in the direction of documenting marshal with the aim of potentially interoperating with other languages, I think it would be a good idea to create a Python-independent marshal library. This would facilitate incorporation into other languages.  Such a library probably wouldn't be able to do everything marshal can (there isn't an obvious C equivalent of Python's dictionary object, for example), but would still help nail down compatibility issues for the basic scalar types.



----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2001-10-12 14:09

Message:
Logged In: YES 
user_id=31435

I'm not sure this is making progress.  Paul, if you want to 
use marshal, you already can:  the pack and unpack routines 
are exposed in Python via the marshal module.  Freezing the 
representation isn't a particularly appealing idea; e.g., 
if anyone is likely to complain about the speed of Python's 
longs, it's you <wink>, and the current marshal format for 
longs is just a raw dump of Python's internal long 
representation -- but the most obvious everything-benefits 
way to speed Python longs is to increase the size of 
the "digits" used in its internal representation.  If 
that's ever done, the marshal format would want to change 
too.

It's easy enough to code your own storage format for longs, 
e.g.

>>> def tobin(i):
...     import binascii
...     ashex = hex(long(i))[2:-1] # chop '0x' and 
trailing 'L'
...     if len(ashex) & 1:
...         ashex = '0' + ashex
...     return binascii.unhexlify(ashex)

implements "base 256" for unsigned longs, and the runtime 
cannot be improved by rewriting in C except by a constant 
factor (the Python spelling has the right O() behavior).


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-10-12 13:24

Message:
Logged In: YES 
user_id=6380

This helps tremendously.

I think that marshal is probably overkill. Rather, you need
helper routines to convert longs to and from binary. You can
do everything else using the struct module, and it's
probably easier to write your own protocol using that and
these helpers. I suggest that the best place to add these
helpers is the binascii module, which already has a bunch of
similar things (e.g. hexlify and crc32).

Note the xmlrpc is bundled with Python 2.2.

Looking forward to your patch (much simpler to get accepted
than a PEP :-).

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2001-10-12 13:16

Message:
Logged In: YES 
user_id=72053

Decimal is bad not just because of the data expansion but
because the arithmetic to convert a decimal string to binary
can be expensive (all that multiplication).  I'd rather use
hex than decimal for that reason.  One envisioned
application is communicating a cryptography coprocessor: an
8-bit microcontroller (with a public key accelerator)
connected to the host computer through a slow serial port.
Most of the ints involved would be around 300 decimal
digits.
A simple binary format is a lot easier to deal with
in that environment than something like xmlrpc.  Also,
the format would be used for data persistence, so again,
unnecessary data expansion isn't desirable.

I looked at XMLRPC and it's not designed for this purpose.
It's intended as an RPC protocol over HTTP and isn't
well suited for object persistence.  Also, it doesn't
support integers over 32 bits, and binary strings must be
base64 encoded (more bloat).  Finally, it's not included
with Python, so I'd have to bundle an implementation
written in Python (i.e. slow) with my application (I don't
know whether Fred's implementation is Python or C).  I
think the marshal format hasn't changed since before
Python 1.5, so basing serialization on marshal would mean
applications could interoperate with older versions of
Python as well as newer ones, which helps Python's maturity.
(Maturity of a program means, among other things, that
users rarely need to be told they need the latest version
in order to use some feature).

Really, the marshal functions are written the way they're
written because that's the simplest and most natural way
of doing this kind of thing.  So the proposal is mainly
to make them available for user applications, rather than
only for system internals.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-10-12 12:37

Message:
Logged In: YES 
user_id=6380

If the PEP makes a reasonable case for freezing the spec,
yes.

I wonder why you can't use decimal? Are you talking really
large volumes? The PEP needs to motivate this with an
example, preferably plucked from real life!

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2001-10-12 12:29

Message:
Logged In: YES 
user_id=72053

I just want to be able to do convenient transfers of
python data to other programs including over the network.
XMLRPC is excessive bloat in my opinion.  Sending a number
like 12345678 should take at most 5 bytes (a type byte and
a 4-byte int) instead of <int>12345678</int>.  For long
ints (300 digits) it's even worse.

The marshal format is fine, and writing a PEP would solve
the doc problem, but the current marshal doc says the
non-specification is intentional.  Writing it in a PEP
means not just documenting--it means asking the language
maintainers to freeze the marshal format of certain types,
instead of reserving the right to change the format in
future versions.  Writing the PEP only makes sense if
you're willing to freeze the format for those types (the
other types can stay undocumented).  Is that ok with you?

Thanks
Paul


----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2001-10-12 07:33

Message:
Logged In: YES 
user_id=6380

Paul, I don't understand the application that you are
envisioning. If you think that the marshal format is what
you want, why don't you write a PEP that specifies the
format? That would solve the documentation problem.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-10-12 02:39

Message:
Logged In: YES 
user_id=21627

Well, then I guess you need to specify your requirements
more clearly. XML-RPC was precisely developed to be
something simple for primitive types and structures that is
sufficiently  well-specified to allow interoperation between
various languages.

I don't see why extending the data 'by an order of
magnitude' would be a problem per se, nor do I see why
'requiring a complicated parser' is a problem if the
implementation already does all the unpacking for you under
the hoods.

Furthermore, I believe it is simply not true that XML-RPC
expands the representation by an order of magnitude. For
example, the Python Integer object 1 takes 12 bytes in its
internal representation (plus the overhead that malloc
requires); the XML-RPC representation '<int>1</int>' also
uses 12 bytes.
In short, you need to say as precise as possible what it is
that you want, or you won't get it. Also, it may be that you
have conflicting requirements (e.g. 'compact, binary', and
'simple, easily processible in different languages'); then
you won't get it either. For a marshalling format that is
accessible from different languages, you better specify it
first, and implement it then.

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2001-10-11 22:12

Message:
Logged In: YES 
user_id=72053

I haven't looked at xmlrpclib, but I'm looking for
a simple, compact, binary representation, not something
that needs a complicated parser and expands the data by
an order of magnitude.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2001-10-05 17:10

Message:
Logged In: YES 
user_id=21627

So what's wrong with xmlrpclib?


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=467384&group_id=5470