Fatal bug in gzip.py

Tim Peters tim_one at email.msn.com
Sun May 30 18:18:51 EDT 1999


[A.M. Kuchling]
> ...
> gzip.py contains loops like this:
>             try:
>                 while 1:
>                     self._read(readsize)
>                     readsize = readsize * 2
>             except EOFError:
>                 size = self.extrasize
>
> The intention is to progressively increase the amount of data that's
> read at each iteration, to avoid many iterations when reading a large
> file.  However, when a file has so many members, it will have to go
> through this loop 378 times, but 378 doublings causes an
> OverflowError.  The fix is to increase readsize more slowly, but I'm
> not sure what the right rate of increase is.  Perhaps simply adding
> 1024 bytes at a time will do; does anyone have another suggestion?

Sure!

1) Abstract the readsize-bumping logic, replacing the inline "* 2", "+
1024", etc with calls to a new module-level function:

    readsize = _bump_readsize(readsize)

2) Now that you're no longer paralyzed by the prospect of needing to track
down and change scattered instances by hand repeatedly <wink>, experiment.
One thing I'd try:

def _bump_readsize(readsize, limit=1024**2):
    assert readsize > 0
    return min(readsize * 2, limit)

That is, doubling seemed like a fine idea, and must be better-behaved than
+1024 for the usual case with a single large member.  OTOH, looks like
making readsize "too big" will cause lots of extra work, reading excess
stuff then seeking back to read it all over again.

3) So now that you've abstracted and experimented, abstract some more!  How
about a shared ReadSize object, methods of which are called by both "_read"
and "read" so that "read" can adapt its requests to the file structure
"_read" is discovering empirically?  For example, the shared object and
"_read" can conspire to remember the size of the last member read, and
influence "read" to keep its requests near that size so long as members keep
popping up with that size.

IOW, "read" is flying blind now; "_read" is the only guy who knows what the
file is turning out to look like.  _read would probably like its first
new-member call to match the size the of previous member, and double
thereafter (up to a limit) until the new member is fully read.  Change those
event descriptions into method names.  Rinse.  Repeat.

just-keep-it-out-of-your-eyes-ly y'rs  - tim






More information about the Python-list mailing list