An attempt at guessing the encoding of a (non-unicode) string

Fri Apr 2 09:24:06 EST 2004

This is a subject that comes up fairly often.  Last night, I had the
following idea, for which I would like feedback from you.

This could be implemented as a function in codecs.py (let's call it
"wild_guess"), that is based on some pre-calculated data.  These
pre-calculated data would be produced as follows:

1. Create a dictionary (key: encoding, value: set of valid bytes for the
encoding)

1a. the sets can be constructed by trial and error:

def valid_bytes(encoding):
    result= set()
    for byte in xrange(256):
        char= chr(byte)
        try:
            char.decode(encoding)
        except UnicodeDecodeError:
            pass
        else:
            result.add(char)
    return result

2. for every 8-bit encoding, some "representative" text is given (the
longer, the better)

2a. the following function is a quick generator of all two-char
sequences from its string argument.  can be used both for the production
of the pre-calculated data and for the analysis of a given string in the
'wild_guess' function.

def str_window(text):
    return itertools.imap(
        text.__getslice__, xrange(0, len(s)-1), xrange(2, len(s)+1)
    )

So for every encoding and 'representative' text, a bag of two-char
sequences and their frequencies is calculated. {frequencies[encoding] =
dict(key: two-chars, value: count)}

2b. do a lengthy comparison of the bags in order to find the most common
two-char sequences that, as a set, can be considered unique for the
specific encoding.

2c. For every encoding, keep only a set of the (chosen in step 2b)
two-char sequences that were judged as 'representative'.  Store these
calculated sets plus those from step 1a as python code in a helper
module to be imported from codecs.py for the wild_guess function
(reproduce the helper module every time some 'representative' text is
added or modified).

3. write the wild_guess function

3a.  the function 'wild_guess' would first construct a set from its
argument:

sample_set= set(argument)

and by set operations against the sets from step 1a, we can exclude
codecs where the sample set is not a subset of the encoding valid set.
I don't expect that this step would exclude many encodings, but I think
it should not be skipped.

3b. pass the argument through the str_window function, and construct a
set of all two-char sequencies

3c. from all sets from step 2c, find the one whose intersection with set
from 3b is longest as a ratio of len(intersection)/len(encoding_set),
and suggest the relevant encoding.

What do you think?  I can't test whether that would work unless I have
'representative' texts for various encodings.  Please feel free to help
or bash :)

PS I know how generic 'representative' is, and how hard it is to qualify
some text as such, therefore the quotes.  That is why I said 'the
longer, the better'.
-- 
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix