An attempt at guessing the encoding of a (non-unicode) string

John Roth newsgroups at jhrothjr.com
Sat Apr 3 09:14:30 EST 2004


"David Eppstein" <eppstein at ics.uci.edu> wrote in message
news:eppstein-8C467F.14490702042004 at news.service.uci.edu...
> I've been getting decent results by a much simpler approach:
> count the number of characters for which the encoding produces a symbol
> c for which c.isalpha() or c.isspace(), subtract a large penalty if
> using the encoding leads to UnicodeDecodeError, and take the encoding
> with the largest count.

Shouldn't that be isalphanum()? Or does your data not have
very many numbers?

John Roth
>
> -- 
> David Eppstein                      http://www.ics.uci.edu/~eppstein/
> Univ. of California, Irvine, School of Information & Computer Science





More information about the Python-list mailing list