Python vs. Java gzip performance

Tue Mar 21 18:47:42 EST 2006

Caleb Hattingh wrote:
> What does ".readlines()" do differently that makes it so much slower
> than ".read().splitlines(True)"?  To me, the "one obvious way to do it"
> is ".readlines()".

readlines reads 100 bytes (at most) at a time. I'm not sure why it
does that (probably in order to not read further ahead than necessary
to get a line (*)), but for gzip, that is terribly inefficient. I
believe the gzip algorithms use a window size much larger than that -
not sure how the gzip library deals with small reads.

One interpretation would be that gzip decompresses the current block
over an over again if the caller only requests 100 bytes each time.
This is a pure guess - you would need to read the zlib source code
to find out.

Anyway, decompressing the entire file at one lets zlib operate at the
highest efficiency.

Regards,
Martin

(*) Guessing further, it might be that "read a lot" fails to work well 
on a socket, as you would have to wait for the complete data before
even returning the first line.

P.S. Contributions to improve this are welcome.