compressing short strings?

Paul Rubin http
Tue May 20 04:15:03 EDT 2008


I have a lot of short English strings I'd like to compress in order to
reduce the size of a database.  That is, I'd like a compression
function that takes a string like (for example) "George Washington"
and returns a shorter string, with luck maybe 6 bytes or so.  One
obvious idea is take the gzip function, compress some large text
corpus with it in streaming mode and throw away the output (but
setting up the internal state to model the statistics of English
text), then put in "George Washington" and treat the additional output
as the compressed string.  Obviously to get reasonable speed there
would have to be a way to save the internal state after initializing
from the corpus.

Anyone know if this has been done and if there's code around for it?
Maybe I'm better off with freezing a dynamic Markov model?  I think
there's DMM code around but am not sure where to look.

Thanks.



More information about the Python-list mailing list