weirdness with list()

Sat Feb 27 19:17:32 EST 2021

I just ran into a surprising (to me) issue with list() on an iterable 
object.

My object represents an MDAT box in an MP4 file: it is the ludicrously 
large data box containing the raw audiovideo data; for a TV episode it 
is often about 2GB and a movie is often 4GB to 6GB. For obvious reasons, 
I do not always want to load that into memory, or even read the data 
part at all when scanning an MP4 file, for example to recite its 
metadata.

So my parser has a "skip" mode where it seeks straight past the data, 
but makes a note of its length in bytes. All good.

That length is presented via the object's __len__ method, because I want 
to know that length later and this is a subclass of a suite of things 
which return their length in bytes this way.

So, to my problem:

I've got a walk method which traverses the hierarchy of boxes in the MP4 
file. Until some minutes ago, it looked like this:

  def walk(self):
    subboxes = list(self)
    yield self, subboxes
    for subbox in subboxes:
      if isinstance(subbox, Box):
        yield from subbox.walk()

somewhat like os.walk does for a file tree.

I noticed that it was stalling, and investigation revealed it was 
stalling at this line:

    subboxes = list(self)

when doing the MDAT box. That box (a) has no subboxes at all and (b) has 
a very large __len__ value.

BUT... It also has a __iter__ value, which like any Box iterates over 
the subboxes. For MDAT that is implemented like this:

    def __iter__(self):
        yield from ()

What I was expecting was pretty much instant construction of an empty 
list. What I was getting was a very time consuming (10 seconds or more) 
construction of an empty list.

I believe that this is because list() tries to preallocate storage. I 
_infer_ from the docs that this is done maybe using 
operator.length_hint, which in turn consults "the actual length of the 
object" (meaning __len__ for me?), then __length_hint__, then defaults 
to 0.

I've changed my walk function like so:

  def walk(self):
    subboxes = []
    for subbox in self:
      subboxes.append(subbox)
    ##subboxes = list(self)

and commented out the former list(self) incantation. This is very fast, 
because it makes an empty list and then appends nothing to it. And for 
your typical movie file this is fine, because there are never _very_ 
many immediate subboxes anyway.

But is there a cleaner way to do this?

I'd like to go back to my former list(self) incantation, and modify the 
MDAT box class to arrange something efficient. Setting __length_hint__ 
didn't help: returning NotImplemeneted or 0 had no effect, because 
presumably __len__ was consulted first.

Any suggestions? My current approach feels rather hacky.

I'm already leaning towards making __len__ return the number of subboxes 
to match the iterator, especially as on reflection not all my subclasses 
are consistent about __len__ meaning the length of their binary form; 
I'm probably going to have to fix that - some subclasses are actually 
namedtuples where __len__ would be the field count. Ugh.

Still, thoughts? I'm interested in any approaches that would have let me 
make list() fast while keeping __len__==binary_length.

I'm accepting that __len__ != len(__iter__) is a bad idea now, though.

Cheers,
Cameron Simpson <cs at cskk.id.au>