An attempt at guessing the encoding of a (non-unicode) string
Jon Willeke
j.dot.willeke at verizon.dot.net
Fri Apr 2 10:05:42 EST 2004
Christos TZOTZIOY Georgiou wrote:
> This is a subject that comes up fairly often. Last night, I had the
> following idea, for which I would like feedback from you.
>
> This could be implemented as a function in codecs.py (let's call it
> "wild_guess"), that is based on some pre-calculated data. These
> pre-calculated data would be produced as follows:
...
> What do you think? I can't test whether that would work unless I have
> 'representative' texts for various encodings. Please feel free to help
> or bash :)
The representative text would, in some circles, be called a training
corpus. See the Natural Language Toolkit for some modules that may help
you prototype this approach:
<http://nltk.sf.net/>
In particular, check out the probability tutorial.
More information about the Python-list
mailing list