compressing short strings?

Tue May 20 09:53:21 EDT 2008

On May 20, 8:24 am, bearophileH... at lycos.com wrote:
> bearophile:
>
> > So you need to store only this 11 byte long string to be able to
> > decompress it.
>
> Note that maybe there is a header, that may contain changing things,
> like the length of the compressed text, etc.
>
> Bye,
> bearophile

I've read that military texts contain different letter frequencies
than standard English.  If you use a non-QWERTY keyset, it may change
your frequency distribution also.

I worry that this is an impolite question; as such, I lean to peers
for backup:

Will you be additionally adding further entries to the zipped list?

Will you be rewriting the entire file upon update, or just appending
bytes?

If your sequence is 'ab, ab, ab, cd, ab', you might be at:

00010.

Add 'cd' again and you're at:

000101.

You didn't have to re-output the contents.

But, if you add 'bc', you have:

0001012, which isn't in binary.  So you're left at:

000 000 000 001 000 001 010

But five more and the byte overflows.

I'd say pickle the corpus, with new additions, and re-zip the entire
contents each time.  Would you like to break across
(coughdisksectorscough) multiple files, say, a different corpus-
compression file pair for every thousand entries?