An attempt at guessing the encoding of a (non-unicode) string

David Eppstein eppstein at ics.uci.edu
Sat Apr 3 14:25:08 EST 2004


In article <106thmedmq162ce at news.supernews.com>,
 "John Roth" <newsgroups at jhrothjr.com> wrote:

> "David Eppstein" <eppstein at ics.uci.edu> wrote in message
> news:eppstein-8C467F.14490702042004 at news.service.uci.edu...
> > I've been getting decent results by a much simpler approach:
> > count the number of characters for which the encoding produces a symbol
> > c for which c.isalpha() or c.isspace(), subtract a large penalty if
> > using the encoding leads to UnicodeDecodeError, and take the encoding
> > with the largest count.
> 
> Shouldn't that be isalphanum()? Or does your data not have
> very many numbers?

It's only important if your text has many code positions which produce a 
digit in one encoding and not in another, and which are hard to 
disambiguate using isalpha() alone.  I haven't encountered that 
situation.

-- 
David Eppstein                      http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science



More information about the Python-list mailing list