compressing short strings?

Tue May 20 09:05:04 EDT 2008

Helmut Jarausch:
> I'd ask in comp.compression where the specialists are listening and who are
> very helpful.

Asking in comp.compression is a good starting point.

My suggestions (sorry if they look a bit unsorted): it depends on what
language you want to use, how much you want to compress the strings,
if they are ASCII or unicode, how much quickly you want to compress/
decompress them, and how much time you have to write the encoder/
decoder.

You can use Python or a C/D extension, if you use C you will need more
time to develop it, but you will be quite more free to use any
algorithm you want (well, you are free with Python too, but the code
may be slower if it performs low level things), so D may be an
acceptable compromise.

Generally it's better to start with the simplest solution, and go on
from there looking for more complex ones if the first one isn't
enough. In Python the simplest thing to try is to compress the strings
with zip:
"George Washington".encode("zip")
Probably the result is longer than the original (25 bytes output for
this string). With "bz2" the results are probably worse (56 bytes for
this string).

A possible alternative it to compress many (10-30) strings at a time,
using an external table of their starting points (you can put the
table at the beginning), or you can separate those strings with a
separator char that's never present in the strings; and then you can
zip that group. If you have 1 million strings you can divide them into
such little groups, but the indexing in the DB becomes complex, and I
don't like this solution much.

A bit less simple python solution is to keep a fixed corpus of text
and compress it with and without the little string you want to add
(like you say), but compressing/decompressing everything each time (if
the corpus is small enough this isn't that slow), for example:

>>> s2 = a 2488 bytes long text
>>> len(s2)
2488
>>> z1 = (s2).encode("zip")
>>> len(z1)
1321
>>> s = 'George Washington'
>>> len((s).encode("zip"))
25
>>> len(s.encode("bz2"))
56
>>> z2 = (s2+s).encode("zip")
>>> len(z2)
1332
>>> len(z2) - len(z1)
11

So you need to store only this 11 byte long string to be able to
decompress it. A bigger corpus (s2) allows you a bit more compression,
but it requires more time for compression/decompression:

>>> len(s3) # s3 is a longer English text
10811
>>> z3 = s3.encode("zip")
>>> len(z3)
4971
>>> z4 = (s3+s).encode("zip")
>>> len(z4)
4980
>>> len(z4) - len(z3)
9

One of the simpler ways to compress short ASCII English texts is to
see that the most frequent chars are very few, so you can encode the
15 most common English chars with 1 nibble, and the other ones with 3
nibbles (first nibble is an escape, and the two following are the
whole byte), if you use a low level language you can encode this with
few lines of code, it's very fast and it may shave off 25-33% of your
bytes. It means that your "George Washington" (17 chars long) may
become about 12-13 chars long. I have implemented this with Python
too, it's easy.

Another possible solution is to use a Huffman compression with fixed
symbol frequencies. In Python there are ways to write such compressor/
decompressor in just 10-15 lines code. You can use the heapq to speed
up the processing. This may lead to more compression, but I presume
not much than less than 4-5 bpc. But if you want to compress/
decompress a LOT of strings you may need to write (or look for the
code) in C/D.

This is one implementation that doesn't use heapq yet:

from operator import itemgetter

def clean_tree(obj):

    if isinstance(obj, tuple):

        return obj[0]

    else:

        return [clean_tree(obj[0][0]), clean_tree(obj[0][1])]

def huffman_generate(freqsl):

    while len(freqsl) > 2:

        freqsl.sort(key=itemgetter(1))

        freqsl = [ [freqsl[0:2], freqsl[0][1] + freqsl[1][1]] ] +
freqsl[2:]

    return [freqsl, -1]

def tree2dict(tree):

    def encode(tree, out=""):

        if isinstance(tree, list):

            encode(tree[0], out+"0")

            encode(tree[1], out+"1")

        else:

            dic[tree[0]] = out

        return dic

    dic = {}

    return encode(tree)

freqs = [('a', 45), ('b', 13), ('c', 12), ('d', 16), ('e', 9), ('f',
5)]

print tree2dict(clean_tree(huffman_generate(freqs)))

If you want to use a Huffman you can use the usual tricks to increase
compression: use 2-char symbols, or even use whole words as symbols
(but you can use whole words only if you are sure your language is
English with few strange words).

If you want more compression, like a 6 bytes (2.8 bpc) output you will
need more complex algorithms. On long English texts the best
compressors need less than 0.9 bpc, but they are surely useless for
you. I don't think you will be able to go down to 6 bytes for that
string, unless you are ready to pay lot of computational time :-)

Bye,
bearophile