An attempt at guessing the encoding of a (non-unicode) string
Robert Brewer
fumanchu at amor.org
Fri Apr 2 11:20:40 EST 2004
Christos "TZOTZIOY" Georgiou wrote:
> 2. for every 8-bit encoding, some "representative" text is given (the
> longer, the better)
>
> 2a. the following function is a quick generator of all two-char
> sequences from its string argument. can be used both for the
> production
> of the pre-calculated data and for the analysis of a given
> string in the
> 'wild_guess' function.
>
> def str_window(text):
> return itertools.imap(
> text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
> )
>
> So for every encoding and 'representative' text, a bag of two-char
> sequences and their frequencies is calculated.
> {frequencies[encoding] =
> dict(key: two-chars, value: count)}
>
> 2b. do a lengthy comparison of the bags in order to find the
> most common
> two-char sequences that, as a set, can be considered unique for the
> specific encoding.
>
> 2c. For every encoding, keep only a set of the (chosen in step 2b)
> two-char sequences that were judged as 'representative'. Store these
> calculated sets plus those from step 1a as python code in a helper
> module to be imported from codecs.py for the wild_guess function
> (reproduce the helper module every time some 'representative' text is
> added or modified).
My only thought is that this sounds awfully like a Bayesian filter so
far... Maybe someone can tweak the SpamAssassin code? ;)
FuManChu
More information about the Python-list
mailing list