This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: gzip.py and files > 2G
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: geertj, tim.peters
Priority: normal Keywords: patch

Created on 2002-10-03 16:16 by geertj, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
python-gzip.diff geertj, 2002-10-04 07:36
Messages (6)
msg41318 - (view) Author: Geert Jansen (geertj) * Date: 2002-10-03 16:16
Problem:

Currently, the gzip module is not able to work with files 
> 2G uncompressed. The source of the problem is that 
at the end of a .gz file, there is a trailer containing a 32  
bit length field. This field is of course unable to represent 
a file length > 4G. Because of mixed type arithmetic in 
gzip.py, this limit is lowered to 2G.

Testcase:

python gzip.py <file> # must be > 2G
python gzip.py -d <file.gz> # error

Proposed fix:

Test the uncompressed data size modulo 4G. A patch 
implementing this fix is attached. This is also the 
solution that gzip itself uses.

Two other remarks:

I don't understand lines 22-23 of gzip.py: why is the 
test: "if value < 0" necessary when writing an unsigned 
int?

The testing of the crc value in GzipFile._read_eof() is 
done modulo 4G. Is this necessary? crc32 is just read 
from the file as a normal int, and self.crc is from zlib.crc 
which always returns a regular int.

Regards,
Geert Jansen
msg41319 - (view) Author: Geert Jansen (geertj) * Date: 2002-10-04 07:36
Logged In: YES 
user_id=537938

Sorry -- it seems the file upload went wrong! Second try.
msg41320 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-11-04 17:08
Logged In: YES 
user_id=31435

Assigned to me.  I think your suggested fix makes good 
sense.
msg41321 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-11-04 19:51
Logged In: YES 
user_id=31435

Fixed, by related changes in

Lib/gzip.py; new revision: 1.36
Misc/NEWS; new revision: 1.508
msg41322 - (view) Author: Geert Jansen (geertj) * Date: 2002-11-05 10:36
Logged In: YES 
user_id=537938

I'm afraid this doesn't fix the whole problem.

You fixed the problem for file sizes in the range 2G-4G, but (if 
I read your patch correctly), files >4G still don't work. On 
Linux it is very easy to create files > 4G and Python supports 
this, so it would be nice to have.

A better fix IMHO would be to test the file size modulo 4G.  
The probability that an invalid gzip files becomes valid by this 
less accurate test is astronomically small (there is also a 
CRC). In fact, this is also the fix that the "official" gzip 
program uses.

I can give you a test account on my Linux machine if you 
want to test a patch and don't have a machine with large file 
support nearby . Or I can test a patch for you.
msg41323 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-11-05 20:40
Logged In: YES 
user_id=31435

Got it.  It's distasteful but pragmatic <wink>.  Fixed again, in

Lib/gzip.py; new revision: 1.37
Misc/NEWS; new revision: 1.510

It was tested "by hand" on Win2K (on a 6+GB file).
History
Date User Action Args
2022-04-10 16:05:43adminsetgithub: 37257
2002-10-03 16:16:48geertjcreate