removing the header from a gzip'd string

debarchana.ghosh at gmail.com debarchana.ghosh at gmail.com
Sat Dec 23 12:44:50 EST 2006


Bjoern Schliessmann wrote:
> Rajarshi wrote:
>
> > Does anybody know how I can remove the header portion of the
> > compressed bytes, such that I only have the compressed data
> > remaining? (Obviously I do not intend to perform the
> > decompression!)
>
> Just curious: What's your goal? :) A home made hash function?

Actually I was implementing the use of the normalized compression
distance to evaluate molecular similarity as described in an article in
J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z, subscriber
access only, unfortunately).

Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).

So I was interested to see if the NCD behaved like a metric if I
removed everything that was not the compressed string. And since I only
need to calculate similarity between two strings, I do not need to do
any decompression.




More information about the Python-list mailing list