An attempt at guessing the encoding of a (non-unicode) string
David Eppstein
eppstein at ics.uci.edu
Mon Apr 5 16:37:34 EDT 2004
In article <6pa270h031thgleo4a31itktb95n9e4rvm at 4ax.com>,
Christos "TZOTZIOY" Georgiou <tzot at sil-tec.gr> wrote:
> >I've been getting decent results by a much simpler approach:
> >count the number of characters for which the encoding produces a symbol
> >c for which c.isalpha() or c.isspace(), subtract a large penalty if
> >using the encoding leads to UnicodeDecodeError, and take the encoding
> >with the largest count.
>
> Somebody (by email only so far) has suggested that spambayes could be
> used to the task... perhaps they're right, however this is not as simple
> and independent a solution I would like to deliver.
>
> I would believe that your idea of a score is a good one; I feel that the
> key should be two-char combinations, but I'll have to compare the
> success rate of both one-char and two-char keys.
>
> I'll try to search for "representative" texts on the web for as many
> encodings as I can; any pointers, links from non-english speakers would
> be welcome in the thread.
BTW, if you're going to implement the single-char version, at least for
encodings that translate one byte -> one unicode position (e.g., not
utf8), and your texts are large enough, it will be faster to precompute
a table of byte frequencies in the text and then compute the score by
summing the frequencies of alphabetic bytes.
--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
More information about the Python-list
mailing list