compressing short strings?
Arnaud Delobelle
arnodel at googlemail.com
Tue May 20 05:19:20 EDT 2008
Paul Rubin <http://phr.cx@NOSPAM.invalid> writes:
> I have a lot of short English strings I'd like to compress in order to
> reduce the size of a database. That is, I'd like a compression
> function that takes a string like (for example) "George Washington"
> and returns a shorter string, with luck maybe 6 bytes or so. One
> obvious idea is take the gzip function, compress some large text
> corpus with it in streaming mode and throw away the output (but
> setting up the internal state to model the statistics of English
> text), then put in "George Washington" and treat the additional output
> as the compressed string. Obviously to get reasonable speed there
> would have to be a way to save the internal state after initializing
> from the corpus.
>
> Anyone know if this has been done and if there's code around for it?
> Maybe I'm better off with freezing a dynamic Markov model? I think
> there's DMM code around but am not sure where to look.
>
> Thanks.
Out of the blue idea which is probably rubbish:
Create a tree of your words, look at it as a Deterministic Finite
Automaton (DFA), minimise it, you get what some call a DAWG. That
works very well with lots of small words.
e.g.
Words aabc, babc, bbbc.
Minimal DFA:
1 -a-> 2 -a-> 4 -b-> 5 -c-> 6
\ ^
b-> 3 -a-----+
\ /
b----+
To compress a word, simply find its position in alphabetical order,
call that its 'path'.
The 'decompression algorithm' is then (in pseudo-python)
def decompress(DFA, path):
letters = []
state = DFA.initial_state
while not state.is_final():
transitions = state.transitions
path, choice = divmod(path, len(transitions))
state = transitions[choice].new_state
letters.append(transitions[choice].letter)
I'm not too sure about this :)
--
Arnaud
More information about the Python-list
mailing list