Encoding sniffer?

Diez B. Roggisch deets at nospam.web.de
Thu Jan 5 16:53:16 EST 2006


garabik-news-2005-05 at kassiopeia.juls.savba.sk schrieb:
> Diez B. Roggisch <deets at nospam.web.de> wrote:
> 
>>>print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']
>>
>>I've fallen into that trap before - it won't work after the iso8859_1. 
>>The reason is that an eight-bit encoding have all 256 code-points 
>>assigned (usually, there are exceptions but you have to be lucky to have 
>>a string that contains a value not assigned in one of them - which is 
>>highly unlikely)
>>
>>AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that 
>>in your example.
> 
> 
> I pasted from a wrong file :-)
> See my previous posting (a few days ago) - what I did was to implement
> iso8859_1_ncc encoding (iso8859_1 without control codes) and
> the line should have been 
> try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']
> 
> where iso8859_1_ncc.py is the same as iso8859_1.py from python
> distribution, with this line different:
> 
> decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))

Ok, I can see that. But still, there would be quite a few overlapping 
codepoints.

I think what the OP (and many more people) wants would be something that 
tries and guesses encodings based on probabilities for certain trigrams 
containing an umlaut for example.

There seems to be a tool called "konwert" out there that does such 
things, and recode has some guessing stuff too, AFAIK - but I  haven't 
seen any special python modules so far.

Diez



More information about the Python-list mailing list