Add a file to a compressed tarfile

Sat Nov 6 05:39:18 EST 2004

Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:
> I am not aware of any such method.  I am fairly certain gzip (and the
> associated zlib) does the following:
>
> while bytes remaining:
>     reset/initialize state
>     while state is not crappy and bytes remaining:
>         compress portion of remaining bytes
>         update state
>
> Even if one could discover the last reset/initialization of state, one
> would still need to decompress the data from then on in order to
> discover the two empty blocks.

This is not entirely true... There is a full flush which is done every n bytes 
(n > 100000 bytes, IIRC), and can also be forced by the programmer. In case 
you do a full flush, the block which you read is complete as is up till the 
point you did the flush.

From the documentation:

"""flush([mode])

All pending input is processed, and a string containing the remaining 
compressed output is returned. mode can be selected from the constants 
Z_SYNC_FLUSH, Z_FULL_FLUSH, or Z_FINISH, defaulting to Z_FINISH. Z_SYNC_FLUSH 
and Z_FULL_FLUSH allow compressing further strings of data and are used to 
allow partial error recovery on decompression, while Z_FINISH finishes the 
compressed stream and prevents compressing any more data. After calling 
flush() with mode set to Z_FINISH, the compress() method cannot be called 
again; the only realistic action is to delete the object."""

Anyway, the state is reset to the initial state after the full flush, so that 
the next block of data is independent from the block that was flushed. So, 
you might start writing after the full flush, but you'd have to make sure 
that the compressed stream was of the same format specification as the one 
previously written (see the compression level parameter of 
compress/decompress), and you'd also have to make sure that the gzip header 
is supressed, and that the FINISH compression block correctly reflects the 
data that was appended (because you basically overwrite the finish block of 
the first compress).

Little example:

>>> import zlib
>>> x = zlib.compressobj(6)
>>> x
<zlib.Compress object at 0xb7e39de0>
>>> a = x.compress("hahahahahaha"*20)
>>> a += x.flush(zlib.Z_FULL_FLUSH)
>>> a
'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff'
>>> b = x.flush(zlib.Z_FINISH)
>>> b
'\x03\x00^\x84^9'
>>> x = zlib.compressobj(6) # New compression object with same compression.
>>> c = x.compress("hahahahahaha"*20)
>>> c += x.flush(zlib.Z_FULL_FLUSH)
>>> c
'x\x9c\xcaH\xcc\x18Q\x10\x00\x00\x00\xff\xff'
>>> d = x.flush(zlib.Z_FINISH)
>>> d
'\x03\x00^\x84^9'
>>> e = a+c[2:] # Strip header of second block.
>>> x = zlib.decompressobj()
>>> f = x.decompress(e)
>>> len(f)
480 # Two times 240 = 480.
>>> f
'haha...' # Rest stripped for clarity.

So, as far as this goes, it works. But:

>>> x = zlib.decompressobj()
>>> e = a+c[2:]+d
>>> f = x.decompress(e)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
zlib.error: Error -3 while decompressing: incorrect data check

You see here that if you append the new end of stream marker of the second 
block (which is written by x.flush(zlib.Z_FINISH)), the data checksum is 
broken, as the data checksum is always written for the entire data, but 
leaving out the end of stream marker doesn't cause data-decompression to 
fail.

I know too little about the internal format of a gzip file (which appends more 
header data, but otherwise is just a zlib compressed stream) to tell whether 
an approach such as this one would also work on gzip-files, but I presume it 
should.

Hope this little explanation helps!

Heiko.