technique to enter text using a mobile phone keypad (T9 dictionary-based disambiguation)

bearophileHUGS at lycos.com bearophileHUGS at lycos.com
Thu Aug 10 07:32:24 EDT 2006


Yu-Xi Lim:

Thank you for your comments, and sorry for my last cryptic answer.

>I think Bearophile isn't refering to compression of the dictionary, but the predictive algorithms used by modern data compressors. However, I think he's over-complicating the issue. It is *not* a data compression problem, imho.<

I agree that my solution is probably too much complex for most purposes
(using a compressor simpler than PAQ is probably better for most of
such purposes), but I think it is a data compression problem, because
compressing data essentially means predicting the next bit, and this
program has to predict what's the most probable letter that the user
wanted to add. See Dasher too at the end of this post.


>While predictive input is desired, the PAQ algorithm utilizes multiple "contexts" (the novel contribution of the paper mentioned below). This is intended for general purpose data compressors which work on a variety of data, such as uncompressed graphics and audio, text, or other binary data. There is however, only one context in this case.<

PAQ8 manages many contexts. Some of them are fit for digital
audio/images, or Jpeg, etc. Such contexts can be removed (disabled)
from the PAQ source code, it's not difficult. But PAQ8 contains many
(more than one) contexts just for textual data, and you can keep such
contexts, because they improve text compression, so they improve the
prediction. (Removing those contexts improves speed and even more it
reduces memory used). For this program I think you can keep the
following ones: Order n, Sparse, Text, Formatted text, Fixed record
length, Context gap, Indirect. If you are short on memory you can
probably remove some of them. If you use the keyboard to input specific
kinds of data, you may add a context for them too.


>A more advanced system (beyond regular T9 and comparable to Motorola's iTap) may consider the context of the word. So typing followed 2255#466 would make "call home" the most likely word.<

A good compressor (a PPM can be okay too) can do this too, its contexts
can be many chars long (but you need lot of memory, probably too much
for a telephone of today).


>https://www.cs.fit.edu/Projects/tech_reports/cs-2005-16.pdf

Note that this document doesn't explain the new versions, that contain
a new good idea. Code for the last version:
http://cs.fit.edu/~mmahoney/compression/paq8h.zip

You can be interested in a similar project, that uses a PPM:
http://www.inference.phy.cam.ac.uk/djw30/dasher/

Using an instrumented PAQ this Dasher can be improved a little (speed
of the algorithm isn't important, because you are compressing few
bits/minute. The dictionaries created by the PAQ can be even frozen, in
some cases, so they can be read from disk/flash at the start of the
program.

Bye,
strong bear hugs,
bearophile




More information about the Python-list mailing list