An attempt at guessing the encoding of a (non-unicode) string

Fri Apr 2 11:20:40 EST 2004

Christos "TZOTZIOY" Georgiou wrote:
> 2. for every 8-bit encoding, some "representative" text is given (the
> longer, the better)
> 
> 2a. the following function is a quick generator of all two-char
> sequences from its string argument.  can be used both for the 
> production
> of the pre-calculated data and for the analysis of a given 
> string in the
> 'wild_guess' function.
> 
> def str_window(text):
>     return itertools.imap(
>         text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
>     )
> 
> So for every encoding and 'representative' text, a bag of two-char
> sequences and their frequencies is calculated. 
> {frequencies[encoding] =
> dict(key: two-chars, value: count)}
> 
> 2b. do a lengthy comparison of the bags in order to find the 
> most common
> two-char sequences that, as a set, can be considered unique for the
> specific encoding.
> 
> 2c. For every encoding, keep only a set of the (chosen in step 2b)
> two-char sequences that were judged as 'representative'.  Store these
> calculated sets plus those from step 1a as python code in a helper
> module to be imported from codecs.py for the wild_guess function
> (reproduce the helper module every time some 'representative' text is
> added or modified).

My only thought is that this sounds awfully like a Bayesian filter so
far... Maybe someone can tweak the SpamAssassin code? ;)

FuManChu