Encoding sniffer?

garabik-news-2005-05 at kassiopeia.juls.savba.sk garabik-news-2005-05 at kassiopeia.juls.savba.sk
Thu Jan 5 16:07:43 EST 2006


Diez B. Roggisch <deets at nospam.web.de> wrote:
>> print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']
> 
> I've fallen into that trap before - it won't work after the iso8859_1. 
> The reason is that an eight-bit encoding have all 256 code-points 
> assigned (usually, there are exceptions but you have to be lucky to have 
> a string that contains a value not assigned in one of them - which is 
> highly unlikely)
> 
> AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that 
> in your example.

I pasted from a wrong file :-)
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been 
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']

where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))


-- 
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!



More information about the Python-list mailing list