[Python-Dev] Lazy unpacking for struct module

Sun Jun 12 17:29:13 CEST 2011

Hi.

We extensively use the struct module to crunch large amounts of binary
data. There are basically two operations for us that only seem to the
naked eye as one: Filtering (see if certain fields have certain
values, throw everything away if not) and inspection (care about all
the fields' values). The filtering-part is very important as most of
the binary data can actually be thrown away and never have to be
inspected any further. When thinking about how to increase
performance, one thought was that a lot of objects are generated by
the struct module that we never need: Unpacking six fields in order to
look at one and then throwing everything away is inefficient
concerning the five other fields. It also makes filtering and
inspecting basically the same operation regarding the (slow) unpacking
of values so we don't really benefit from filtering. This is a huge
problem when crunching gigabytes of data and creating millions of
fields.

One solution to this is using two format-strings instead of only one
(e.g. '4s4s i 4s2s2s'): One that unpacks just the filtered fields
(e.g. '8x i 8x') and one that unpacks all the fields except the one
already created by the filter (e.g. '4s4s  4x  4s2s2s'). This solution
works very well and increases throughput by far. It however also
creates complexity in the code as we have to keep track and combine
field-values that came from the filtering-part with the ones unpacked
during inspection-part (we don't want to simply unpack twice).

I'd like to propose an enhancement to the struct module that should
solve this dilemma and ask for your comments.

The function s_unpack_internal() inside _struct.c currently unpacks
all values from the buffer-object passed to it and returns a tuple
holding these values. Instead, the function could create a tuple-like
object that holds a reference to it's own Struct-object (which holds
the format) and a copy of the memory it is supposed to unpack. This
object allows access to the unpacked values through the sequence
protocol, basically unpacking the fields if - and only if - accessed
through sq_item (e.g. foo = struct.unpack('2s2s', 'abcd'); foo[0] ==
'ab'). The object can also unpack all fields only once (as all
unpacked objects are immutable, we can hold references to them and
return these instead once known). This approach is possible because
there are no further error conditions inside the unpacking-functions
that we would *have* to deal with at the time .unpack() is called; in
other words: Unpacking can't fail if the format-string's syntax had
been correct and can therefor be deferred (while packing can't).

I understand that this may seem like a single-case-optimization. We
can however assume that most people will benefit from the new behavior
unknowingly while everyone else takes now harm: The object mimicking
the otherwise returned tuple is immutable (therefor it's not suddenly
part of GC) and the memory overhead caused by holding references to
the original memory a little longer (reclaimed after the result
becomes unreachable) should be comparable to the memory used by
unneeded fields (reclaimed directly after creation).

I'd like to hear your thoughts and am perfectly willing to provide a
patch if it has a chance of inclusion.

Best regards
Lukas