Add a file to a compressed tarfile

Josiah Carlson jcarlson at uci.edu
Sat Nov 6 12:22:16 EST 2004


Heiko Wundram <heikowu at ceosg.de> wrote:
> 
> Am Freitag, 5. November 2004 19:19 schrieb Josiah Carlson:
> > I am not aware of any such method.  I am fairly certain gzip (and the
> > associated zlib) does the following:
> >
> > while bytes remaining:
> >     reset/initialize state
> >     while state is not crappy and bytes remaining:
> >         compress portion of remaining bytes
> >         update state
> >
> > Even if one could discover the last reset/initialization of state, one
> > would still need to decompress the data from then on in order to
> > discover the two empty blocks.
> 
> This is not entirely true... There is a full flush which is done every n bytes 
> (n > 100000 bytes, IIRC), and can also be forced by the programmer. In case 
> you do a full flush, the block which you read is complete as is up till the 
> point you did the flush.

[snip explanation]

Thank you for the great information!

So it seems that one would still need to do the following in order to
get tgz appending done:

1. Find the last compressed section of the tar file.
2. Invert the checksum (CRC32 is easy) to the end of the usable tarfile.
3. Take note and adjust the size provided in the gzip footer.
4. Seek to the end of the usable tarfile.
5. Write a Z_FULL_FLUSH to start on a new block.
6. Write the new compressed data, and make sure you keep track of the
checksum (either by injecting it into zlib and/or gzip some way, or
manually computing it).
7. Write a Z_FINISH, update/write the checksum and and size trailers.


All in all, it doesn't look too hard.  I think such a thing could be
done in an afternoon, and would be a truely nifty addition to the Python
standard library.

BZip2 on the other hand...looks to be nice because of the block
structure, but each block is huffman coded, so it may not be possible to
discover the final block very easily (also the file format isn't leaping
out at me from the BZip2 docs).

 - Josiah




More information about the Python-list mailing list