[Python-Dev] Lazy unpacking for struct module

Terry Reedy tjreedy at udel.edu
Sun Jun 12 22:31:47 CEST 2011


On 6/12/2011 11:29 AM, Lukas Lueg wrote:

This sort of speculative idea might fit the python-ideas list better.

[Summary: we often need to extract a field or two from a binary record 
in order to decide whether to toss it or unpack it all and process.]

> One solution to this is using two format-strings instead of only one
> (e.g. '4s4s i 4s2s2s'): One that unpacks just the filtered fields
> (e.g. '8x i 8x') and one that unpacks all the fields except the one
> already created by the filter (e.g. '4s4s  4x  4s2s2s'). This solution
> works very well and increases throughput by far. It however also
> creates complexity in the code as we have to keep track and combine
> field-values that came from the filtering-part with the ones unpacked
> during inspection-part (we don't want to simply unpack twice).

With just 1 or 2 filter fields, and very many other fields, I would just 
unpack everything, including the filter field. I expect the extra time 
to do that would be comparalbe to the extra time to combine. It 
certainly would make your code easier. I suspect you could write a 
function to create the filter field only format by field number from the 
everything format.

> I'd like to propose an enhancement to the struct module that should
> solve this dilemma and ask for your comments.
>
> The function s_unpack_internal() inside _struct.c currently unpacks
> all values from the buffer-object passed to it and returns a tuple
> holding these values. Instead, the function could create a tuple-like
> object that holds a reference to it's own Struct-object (which holds
> the format) and a copy of the memory it is supposed to unpack. This
> object allows access to the unpacked values through the sequence
> protocol, basically unpacking the fields if - and only if - accessed
> through sq_item (e.g. foo = struct.unpack('2s2s', 'abcd'); foo[0] ==
> 'ab'). The object can also unpack all fields only once (as all
> unpacked objects are immutable, we can hold references to them and
> return these instead once known). This approach is possible because
> there are no further error conditions inside the unpacking-functions
> that we would *have* to deal with at the time .unpack() is called; in
> other words: Unpacking can't fail if the format-string's syntax had
> been correct and can therefor be deferred (while packing can't).
>
> I understand that this may seem like a single-case-optimization.

Yep.

> We
> can however assume that most people will benefit from the new behavior
> unknowingly while everyone else takes now harm:

I will not assume that without code and timings. I would expect that 
unpacking one field at a time would take longer than all at once. To me, 
this is the sort of thing that should be written, listed on PyPI, and 
tested by multiple users on multiple systems first.

-- 
Terry Jan Reedy



More information about the Python-Dev mailing list