[Python-ideas] Ideas for improving the struct module

Nick Timkovich prometheus235 at gmail.com
Thu Jan 19 14:20:08 EST 2017


Construct has radical API changes and should remain apart. It feels to me
like a straw-man to introduce a large library to the discussion as
justification for it being too-specialized.

This proposal to me seems much more modest: add another format character
(or two) to the existing set of a dozen or so that will be packed/unpacked
just like the others. It also has demonstrable use in various
formats/protocols.

On Thu, Jan 19, 2017 at 12:50 PM, Nathaniel Smith <njs at pobox.com> wrote:

> I haven't had a chance to use it myself yet, but I've heard good things
> about
>
> https://construct.readthedocs.io/en/latest/
>
> It's certainly far more comprehensive than struct for this and other
> problems.
>
> As usual, there's some tension between adding stuff to the stdlib versus
> using more specialized third-party packages. The existence of packages like
> construct doesn't automatically mean that we should stop improving the
> stdlib, but OTOH not every useful thing can or should be in the stdlib.
>
> Personally, I find myself parsing uleb128-prefixed strings more often than
> u4-prefixed strings.
>
> On Jan 19, 2017 10:42 AM, "Nick Timkovich" <prometheus235 at gmail.com>
> wrote:
>
>> ctypes.Structure is *literally* the interface to the C struct that as
>> Chris mentions has fixed offsets for all members. I don't think that should
>> (can?) be altered.
>>
>> In file formats (beyond net protocols) the string size + variable length
>> string motif comes up often and I am frequently re-implementing the
>> two-line read-an-int + read-{}.format-bytes.
>>
>> On Thu, Jan 19, 2017 at 12:17 PM, Joao S. O. Bueno <jsbueno at python.org.br
>> > wrote:
>>
>>> I am for upgrading struct to these, if possible.
>>>
>>> But besides my +1,  I am writting in to remember folks thatthere is
>>> another
>>> "struct" model in the stdlib:
>>>
>>> ctypes.Structure  -
>>>
>>> For reading a lot of records with the same structure it is much more
>>> handy than
>>> struct, since it gives one a suitable Python object on instantiation.
>>>
>>> However, it also can't handle variable lenght fields automatically.
>>>
>>> But maybe, the improvement could be made on that side, or another package
>>> altogether taht works more like it than current "struct".
>>>
>>>
>>>
>>> On 19 January 2017 at 16:08, Elizabeth Myers <elizabeth at interlinked.me>
>>> wrote:
>>> > On 19/01/17 06:47, Elizabeth Myers wrote:
>>> >> On 19/01/17 05:58, Rhodri James wrote:
>>> >>> On 19/01/17 08:31, Mark Dickinson wrote:
>>> >>>> On Thu, Jan 19, 2017 at 1:27 AM, Steven D'Aprano <
>>> steve at pearwood.info>
>>> >>>> wrote:
>>> >>>>> [...] struct already supports
>>> >>>>> variable-width formats.
>>> >>>>
>>> >>>> Unfortunately, that's not really true: the Pascal strings it
>>> supports
>>> >>>> are in some sense variable length, but are stored in a fixed-width
>>> >>>> field. The internals of the struct module rely on each field
>>> starting
>>> >>>> at a fixed offset, computable directly from the format string. I
>>> don't
>>> >>>> think variable-length fields would be a good fit for the current
>>> >>>> design of the struct module.
>>> >>>>
>>> >>>> For the OPs use-case, I'd suggest a library that sits on top of the
>>> >>>> struct module, rather than an expansion to the struct module itself.
>>> >>>
>>> >>> Unfortunately as the OP explained, this makes the struct module a
>>> poor
>>> >>> fit for protocol decoding, even as a base layer for something.  It's
>>> one
>>> >>> of the things I use python for quite frequently, and I always end up
>>> >>> rolling my own and discarding struct entirely.
>>> >>>
>>> >>
>>> >> Yes, for variable-length fields the struct module is worse than
>>> useless:
>>> >> it actually reduces clarity a little. Consider:
>>> >>
>>> >>>>> test_bytes = b'\x00\x00\x00\x0chello world!'
>>> >>
>>> >> With this, you can do:
>>> >>
>>> >>>>> length = int.from_bytes(test_bytes[:4], 'big')
>>> >>>>> string = test_bytes[4:length]
>>> >>
>>> >> or you can do:
>>> >>
>>> >>>>> length = struct.unpack_from('!I', test_bytes)[0]
>>> >>>>> string = struct.unpack_from('{}s'.format(length), test_bytes,
>>> 4)[0]
>>> >>
>>> >> Which looks more readable without consulting the docs? ;)
>>> >>
>>> >> Building anything on top of the struct library like this would lead to
>>> >> worse-looking code for minimal gains in efficiency. To quote Jamie
>>> >> Zawinksi, it is like building a bookshelf out of mashed potatoes as it
>>> >> stands.
>>> >>
>>> >> If we had an extension similar to netstruct:
>>> >>
>>> >>>>> length, string = struct.unpack('!I$', test_bytes)
>>> >>
>>> >> MUCH improved readability, and also less verbose. :)
>>> >
>>> > I also didn't mention that when you are unpacking iteratively (e.g.,
>>> you
>>> > have multiple strings), the code becomes a bit more hairy:
>>> >
>>> >>>> test_bytes = b'\x00\x05hello\x00\x07goodbye\x00\x04test'
>>> >>>> offset = 0
>>> >>>> while offset < len(test_bytes):
>>> > ...     length = struct.unpack_from('!H', test_bytes, offset)[0]
>>> > ...     offset += 2
>>> > ...     string = struct.unpack_from('{}s'.format(length), test_bytes,
>>> > offset)[0]
>>> > ...     offset += length
>>> >
>>> > It actually gets a lot worse when you have to unpack a set of strings
>>> in
>>> > a context-sensitive manner. You have to be sure to update the offset
>>> > constantly so you can always unpack strings appropriately. Yuck!
>>> >
>>> > It's worth mentioning that a few years ago, a coworker and I found
>>> > ourselves needing variable length strings in the context of a binary
>>> > protocol (DHCP), and wound up abandoning the struct module entirely
>>> > because it was unsuitable. My co-worker said the same thing I did:
>>> "it's
>>> > like building a bookshelf out of mashed potatoes."
>>> >
>>> > I do understand it might require a possible major rewrite or major
>>> > changes the struct module, but in the long run, I think it's worth it
>>> > (especially because the struct module is not all that big in scope). As
>>> > it stands, the struct module simply is not suited for protocols where
>>> > you have variable-length strings, and in my experience, that is the
>>> vast
>>> > majority of modern binary protocols on the Internet.
>>> >
>>> > --
>>> > Elizabeth
>>> > _______________________________________________
>>> > Python-ideas mailing list
>>> > Python-ideas at python.org
>>> > https://mail.python.org/mailman/listinfo/python-ideas
>>> > Code of Conduct: http://python.org/psf/codeofconduct/
>>> _______________________________________________
>>> Python-ideas mailing list
>>> Python-ideas at python.org
>>> https://mail.python.org/mailman/listinfo/python-ideas
>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>>
>>
>>
>> _______________________________________________
>> Python-ideas mailing list
>> Python-ideas at python.org
>> https://mail.python.org/mailman/listinfo/python-ideas
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170119/8d09c084/attachment-0001.html>


More information about the Python-ideas mailing list