An attempt at guessing the encoding of a (non-unicode) string

Christos TZOTZIOY Georgiou tzot at sil-tec.gr
Mon Apr 5 05:55:25 EDT 2004


On Fri, 02 Apr 2004 14:49:07 -0800, rumours say that David Eppstein
<eppstein at ics.uci.edu> might have written:

>I've been getting decent results by a much simpler approach:
>count the number of characters for which the encoding produces a symbol 
>c for which c.isalpha() or c.isspace(), subtract a large penalty if 
>using the encoding leads to UnicodeDecodeError, and take the encoding 
>with the largest count.

Somebody (by email only so far) has suggested that spambayes could be
used to the task... perhaps they're right, however this is not as simple
and independent a solution I would like to deliver.

I would believe that your idea of a score is a good one; I feel that the
key should be two-char combinations, but I'll have to compare the
success rate of both one-char and two-char keys.

I'll try to search for "representative" texts on the web for as many
encodings as I can; any pointers, links from non-english speakers would
be welcome in the thread.
-- 
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix



More information about the Python-list mailing list