Algorithm that makes maximum compression of completly diffused data.

Chris Angelico rosuav at gmail.com
Sun Nov 3 00:10:30 EDT 2013


On Sun, Nov 3, 2013 at 2:17 PM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> There is a way to apparently get around these limits: store data
> externally, perhaps inside the compression application itself. Then, if
> you just look at the compressed file (the "data.zip" equivalent, although
> I stress that zip compression is *not* like this), you might think it has
> shrunk quite a lot. But when you include the hidden data, the compression
> is not quite so impressive...

Storing externally is, of course, a very useful thing - it's just not
compression. For instance, a git repository (and quite likely a
Mercurial one too) keeps track of files as blobs of data referenced by
their cryptographic hashes. The current state of a directory can be
described by listing file names with their object hashes, which is a
very compact notation; but it doesn't have _all_ the information, and
the "decompression" process involves fetching file contents from the
library. It's a tool in the arsenal, but it's not compression in the
same way that PK-ZIP is.

With real compression algorithms, there's usually an "out" clause that
detects that the algorithm's doing a bad job. The file format for
PK-ZIP and, I expect, every other archiving+compression file format,
has allowances for some of the files to be stored uncompressed. That
way, there's no need for any file to be enlarged by more than some
signature - say, a one-byte "compression type" marker, or even a
one-bit mark somewhere. But enlarged they must be, for the reasons
Steven explained.

ChrisA



More information about the Python-list mailing list