An attempt at guessing the encoding of a (non-unicode) string

David Eppstein eppstein at ics.uci.edu
Mon Apr 5 16:37:34 EDT 2004


In article <6pa270h031thgleo4a31itktb95n9e4rvm at 4ax.com>,
 Christos "TZOTZIOY" Georgiou <tzot at sil-tec.gr> wrote:

> >I've been getting decent results by a much simpler approach:
> >count the number of characters for which the encoding produces a symbol 
> >c for which c.isalpha() or c.isspace(), subtract a large penalty if 
> >using the encoding leads to UnicodeDecodeError, and take the encoding 
> >with the largest count.
> 
> Somebody (by email only so far) has suggested that spambayes could be
> used to the task... perhaps they're right, however this is not as simple
> and independent a solution I would like to deliver.
> 
> I would believe that your idea of a score is a good one; I feel that the
> key should be two-char combinations, but I'll have to compare the
> success rate of both one-char and two-char keys.
> 
> I'll try to search for "representative" texts on the web for as many
> encodings as I can; any pointers, links from non-english speakers would
> be welcome in the thread.

BTW, if you're going to implement the single-char version, at least for 
encodings that translate one byte -> one unicode position (e.g., not 
utf8), and your texts are large enough, it will be faster to precompute 
a table of byte frequencies in the text and then compute the score by 
summing the frequencies of alphabetic bytes.

-- 
David Eppstein                      http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science



More information about the Python-list mailing list