etree, gzip, and BytesIO

Kushal Kumaran kushal at locationd.net
Thu Jan 21 10:57:03 EST 2021


On Thu, Jan 21 2021 at 08:22:08 AM, Frank Millman <frank at chagford.com> wrote:
> Hi all
>
> This question is mostly to satisfy my curiosity.
>
> In my app I use xml to represent certain objects, such as form
> definitions and process definitions.
>
> They are stored in a database. I use etree.tostring() when storing
> them and etree.fromstring() when reading them back. They can be quite
> large, so I use gzip to compress them before storing them as a blob.
>
> The sequence of events when reading them back is -
>    - select gzip'd data from database
>    - run gzip.decompress() to convert to a string
>    - run etree.fromstring() to convert to an etree object
>
> I was wondering if I could avoid having the unzipped string in memory,
> and create the etree object directly from the gzip'd data. I came up
> with this -
>
>    - select gzip'd data from database
>    - create a BytesIO object - fd = io.BytesIO(data)
>    - use gzip to open the object - gf = gzip.open(fd)
>    - run etree.parse(gf) to convert to an etree object
>
> It works.
>
> But I don't know what goes on under the hood, so I don't know if this
> achieves anything. If any of the steps involves decompressing the data
> and storing the entire string in memory, I may as well stick to my
> present approach.
>
> Any thoughts?
>

etree.parse will hold the entire uncompressed content in memory
regardless of how you supply it input.  If your question is whether you
can avoid holding an extra copy in memory, you can take a look at the
ElementTree code in
https://github.com/python/cpython/blob/3.9/Lib/xml/etree/ElementTree.py
(linked from the documentation of the library module).  The parse method
appears to read 64k at a time from the underlying stream, so using the
gzip.open stream instead of gzip.decompress should limit the duplicated
data being held in memory.

It is possible to use the XMLPullParser or iterparse etree features to
incrementally parse XML without ever holding the entire content in
memory.  But that will not give you an ElementTree object, and might not
be feasible without an entire rewrite of the rest of the code.

-- 
regards,
kushal


More information about the Python-list mailing list