[Patches] [ python-Patches-1675951 ] [gzip] Performance for small reads and fix seek problem

Fri Mar 16 10:06:46 CET 2007

Patches item #1675951, was opened at 2007-03-07 18:57
Message generated for change (Comment added) made by lucas_malor
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1675951&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Modules
Group: Python 2.6
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Florian Festi (florianfesti)
Assigned to: Nobody/Anonymous (nobody)
Summary: [gzip] Performance for small reads and fix seek problem

Initial Comment:
This patch improves read performance of the gzip module. We have seen improvments from 20% from reading big blocks up to factor 50 for reading 4 byte blocks. Additionally this patch removes the need of seeking within the file with allows using streams like stdin as input.

Details:

The whole read(),readline() stack got rewritten. Core part is a new input buffer. It consists of a list of strings (self.buffer), an offset of what has already be consumed from the first string (self.pos) and the length of the (still consumable) buffer (self.bufferlen). This makes adding to and removing from the buffer cheap. It turned out that removing from the old buffer was breaking performance as for reading one byte the whole buffer had to be copied. For reading a 2k buffer in single bytes 2MB had to be copied.

readline no longer uses read() but directly works on the buffer. This removes a whole layer of copying strings together.

For removing the need of seeking a new readonly filelike class is used (PaddedFile). It just prepends a string to a file and uses the file's read method when the string got used up.

There is probably still some space for tweaking when it comes to buffere sizes as we kept this simple. But the great performance improvments show that we can't be off that much.

Performance test program and results are attached.

----------------------------------------------------------------------

Comment By: Lucas Malor (lucas_malor)
Date: 2007-03-16 10:06

Message:
Logged In: YES 
user_id=1403274
Originator: NO

Indeed the problem is seek() is not implementes for urllib objects. It's
not a bug of your patch, sorry if I explained it bad.

----------------------------------------------------------------------

Comment By: Florian Festi (florianfesti)
Date: 2007-03-16 09:27

Message:
Logged In: YES 
user_id=1736372
Originator: YES

There is a simple solution for that problem: DON'T!

If you pass a file descriptor to the GzipFile class it is your
responsibility that the file is a gzip file and that the file pointer is at
the right position. After having a short look into original code I don't
think it can cope with the use case you brought up.

I'd even argue that seeking to the beginning of the file is broken
behaviour. What if the gzip content doesn't start at the beginning of the
file? In fact prepending some header before compressed data is quite
common. If the file you where reading really had a header your example had
just worked.

----------------------------------------------------------------------

Comment By: Lucas Malor (lucas_malor)
Date: 2007-03-15 19:30

Message:
Logged In: YES 
user_id=1403274
Originator: NO

I applied the patch by hand, I think the problem is simply it's for python
2.6 (I have the stable 2.5 version)

Anyway like I wrote for an old similar patch of another user, the patch
starts to read the header at the current position, and not at the start of
the file. You can see it trying this piece of code:

---------------------------------------
import urllib2
import array
import gzip

urlfile = urllib2.urlopen(someurl)
header = array.array("B")
header.fromstring(urlfile.read(1))

gzfile = gzip.GzipFile(fileobj=urlfile)
print gzfile.read()
-----------------------------------------

Error:
------------------------------------------------------------------------------
  File "c:\My Programs\Python\lib\gzip.py", line 285, in read
    self._read_gzip_header(self.fileobj)
  File "c:\My Programs\Python\lib\gzip.py", line 177, in
_read_gzip_header
    raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
>Exit code: 1
------------------------------------------------------------------------------

I don't know how you can solve this without seek()

Anyway if you are interested I created the diff for Python 2.5 :

http://digilander.libero.it/LucasMalor/gzip_2.5.py.diff.gz

----------------------------------------------------------------------

Comment By: Florian Festi (florianfesti)
Date: 2007-03-15 18:43

Message:
Logged In: YES 
user_id=1736372
Originator: YES

Hm, works one Linux.
Try this one
File Added: gzip.py.diff

----------------------------------------------------------------------

Comment By: Lucas Malor (lucas_malor)
Date: 2007-03-15 17:42

Message:
Logged In: YES 
user_id=1403274
Originator: NO

Excuse me, but I can't apply the patch. I have Windows XP without any SP
and I tried to do the command

patch -u gzip.py gzip.py.diff

----------------------------------------------------------------------

Comment By: Florian Festi (florianfesti)
Date: 2007-03-09 10:24

Message:
Logged In: YES 
user_id=1736372
Originator: YES

I added checks to test_readline() and test_append(). I left the other read
test untouched to keep some filename=filename  coverage.

BTW: I really forgot special thanks for Phil Knirsch who wrote the initial
read() implementation and the performance test cases and did a lot of
weaking and testing and without whom this patch would never have existed.
File Added: test_gzip.py-noseek.diff

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2007-03-08 21:58

Message:
Logged In: YES 
user_id=849994
Originator: NO

The patch looks good, and the tests pass. Can you add a test that ensures
that a seek() method is not necessary any more for the underlying stream?

(closed #914340 which provided a similar patch as duplicate)

----------------------------------------------------------------------

Comment By: Florian Festi (florianfesti)
Date: 2007-03-07 19:24

Message:
Logged In: YES 
user_id=1736372
Originator: YES

Added minor improvement of the unittest: Check for newlines in the
readline() test code.
File Added: test_gzip.py.diff

----------------------------------------------------------------------

Comment By: Florian Festi (florianfesti)
Date: 2007-03-07 19:01

Message:
Logged In: YES 
user_id=1736372
Originator: YES

File Added: results.txt

----------------------------------------------------------------------

Comment By: Florian Festi (florianfesti)
Date: 2007-03-07 18:58

Message:
Logged In: YES 
user_id=1736372
Originator: YES

File Added: test_gzip2.py

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1675951&group_id=5470