An attempt at guessing the encoding of a (non-unicode) string

David Eppstein eppstein at ics.uci.edu
Fri Apr 2 17:49:07 EST 2004


I've been getting decent results by a much simpler approach:
count the number of characters for which the encoding produces a symbol 
c for which c.isalpha() or c.isspace(), subtract a large penalty if 
using the encoding leads to UnicodeDecodeError, and take the encoding 
with the largest count.

-- 
David Eppstein                      http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science



More information about the Python-list mailing list