removing the header from a gzip'd string

Fredrik Lundh fredrik at pythonware.com
Sun Dec 24 06:15:34 EST 2006


debarchana.ghosh at gmail.com wrote:

> Essentially, they note that the NCD does not always bevave like a
> metric and one reason they put forward is that this may be due to the
> size of the header portion (they were using the command line gzip and
> bzip2 programs) compared to the strings being compressed (which are on
> average 48 bytes long).

gzip datastreams have a real header, with a file type identifier, 
optional filenames, comments, and a bunch of flags.

but even if you strip that off (which is basically what happens if you 
use zlib.compress instead of gzip), I doubt you'll get representative 
"compressability" metrics on strings that short.  like most other 
compression algorithms, those algorithms are designed for much larger 
datasets.

</F>




More information about the Python-list mailing list