An attempt at guessing the encoding of a (non-unicode) string

Jon Willeke j.dot.willeke at verizon.dot.net
Fri Apr 2 10:05:42 EST 2004


Christos TZOTZIOY Georgiou wrote:
> This is a subject that comes up fairly often.  Last night, I had the
> following idea, for which I would like feedback from you.
> 
> This could be implemented as a function in codecs.py (let's call it
> "wild_guess"), that is based on some pre-calculated data.  These
> pre-calculated data would be produced as follows:
...
> What do you think?  I can't test whether that would work unless I have
> 'representative' texts for various encodings.  Please feel free to help
> or bash :)

The representative text would, in some circles, be called a training 
corpus.  See the Natural Language Toolkit for some modules that may help 
you prototype this approach:

   <http://nltk.sf.net/>

In particular, check out the probability tutorial.



More information about the Python-list mailing list