ZipFile - file adding API incomplete?

Tue Nov 17 23:10:51 EST 2009

On Tue, Nov 17, 2009 at 9:28 AM, Dave Angel <davea at ieee.org> wrote:
> I'm pretty sure that the ZIP format uses independent compression for each
> contained file (member).  You can add and remove members from an existing
> ZIP, and use several different compression methods within the same file.  So
> the adaptive tables start over for each new member.

This is correct.  It doesn't do solid compression, which is what you
get with .tar.gz (and RARs, optionally).

> What isn't so convenient is that the sizes are apparently at the end.  So if
> you're trying to unzip "over the wire" you can't readily do it without
> somehow seeking to the end.  That same feature is a good thing when it comes
> to spanning zip files across multiple disks.

Actually, there are two copies of the headers: one immediately before
the file data (the local file header), and one at the end (the central
directory); both contain copies of the compressed and uncompressed
file size.  Very few programs actually use the local file headers, but
it's very nice to have the option.  It also helps makes ZIPs very
recoverable.  If you've ever run a ZIP recovery tool, they're usually
just reconstructing the central directory from the local file headers
(and probably recomputing the CRCs).

(This is no longer true if bit 3 of the bitflags is set, which puts
the CRC and filesizes after the data.  In that case, it's not possible
to stream data--largely defeating the benefit of the local headers.)

> Define a calls to read _portions_ of the raw (compressed, encrypted, whatever) data.

I think the clean way is to return a file-like object for a specified file, eg.:

# Read raw bytes 1024-1152 from each file in the ZIP:
zip = ZipFile("file.zip", "r")
for info in zip.infolist():
    f = zip.rawopen(info) # or a filename
    f.seek(1024)
    f.read(128)

> Define a call that locks the ZipFile object and returns a write handle for a single new file.

I'd use a file-like object here, too, for probably obvious
reasons--you can pass it to anything expecting a file object to write
data to (eg. shutil.copyfile).

> Only on successful close of the "write handle" is the new directory written.

Rather, when the new file is closed, its directory entry is saved to
ZipFile.filelist.  The new directory on disk should be written when
the zip's own close() method is called, just as when writing files
with the other methods.  Otherwise, writing lots of files in this way
would write and overwrite the central directory repeatedly.

Any thoughts about this rough API outline:

ZipFile.rawopen(zinfo_or_arcname)
Same definition as open(), but returns the raw data.  No mode (no
newline translation for raw files); no pwd (raw files aren't
decrypted).

ZipFile.writefile(zinfo[, raw])
Definition like ZipInfo.writestr.  Relax writestr()'s "at least the
filename, date, and time must be given" rule: if not specified, use
the current date and time.  Returns a file-like object (ZipWriteFile)
which file data is written to.  If raw is True, no actual compression
is performed, and the file data should already be compressed with the
specified compression type (no checking is performed).  If raw is
False (the default), the data will be compressed before being written.
 When finished writing data, the file must be closed.  Only one
ZipWriteFile may be open for each ZipFile at a time.  Calls to
ZipFile.writefile while a ZipWriteFile is already open will result in
ValueError[1].

Another detail: is the CRC recomputed when writing in raw mode?  No.
If I delete a file from a ZIP (causing me to rewrite the ZIP) and
another file in the ZIP is corrupt, it should just move the file
as-is, invalid CRC and all; it should not rewrite the file with a new
CRC (masking the corruption) or throw an error (I should not get
errors about file X being corrupt if I'm deleting file Y).  When
writing in raw mode, if zinfo.CRC is already specified (not None), it
should be used as-is.

I don't like how this results in three different APIs for adding data
(write, writestr, writefile), but trying to squeeze the APIs together
feels unnatural--the parameters don't really line up too well.  I'd
expect the other two to become thin wrappers around
ZipFile.writefile().  This never opens files directly like
ZipFile.write, so it only takes a zinfo and not a filename (set the
filename through the ZipInfo).

Now you can stream data into a ZIP, specify all metadata for the file,
and you can stream in compressed data from another ZIP (for deleting
files and other cases) without recompressing.  This also means you can
do all of these things to encrypted files without the password, and to
files compressed with unknown methods, which is currently impossible.

> and I realize that the big flaw in this design is that from the moment you start overwriting the existing master directory until you write
a new master at the end, your do not have a valid zip file.

The same is true when appending to a ZIP with ZipFile.write(); until
it finishes, the file on disk isn't a valid ZIP.  That's unavoidable.
Files in the ZIP can still be opened by the existing ZipFile object,
since it keeps the central directory in memory.

For what it's worth, I've written ZIP parsing code several times over
the years (https://svn.stepmania.com/svn/trunk/stepmania/src/RageFileDriverZip.cpp),
so I'm familiar with the more widely-used parts of the file format,
but I havn't dealt with ZIP writing very much. I'm not sure if I'll
have time to get to this soon, but I'll keep thinking about it.

[1] seems odd, but mimicing
http://docs.python.org/library/stdtypes.html#file.close

-- 
Glenn Maynard