Encoding sniffer?
garabik-news-2005-05 at kassiopeia.juls.savba.sk
garabik-news-2005-05 at kassiopeia.juls.savba.sk
Thu Jan 5 16:07:43 EST 2006
Diez B. Roggisch <deets at nospam.web.de> wrote:
>> print try_encodings(text, ['ascii', 'utf-8', 'iso8859_1', 'cp1252', 'macroman']
>
> I've fallen into that trap before - it won't work after the iso8859_1.
> The reason is that an eight-bit encoding have all 256 code-points
> assigned (usually, there are exceptions but you have to be lucky to have
> a string that contains a value not assigned in one of them - which is
> highly unlikely)
>
> AFAIK iso-8859-1 has all codepoints taken - so you won't go beyond that
> in your example.
I pasted from a wrong file :-)
See my previous posting (a few days ago) - what I did was to implement
iso8859_1_ncc encoding (iso8859_1 without control codes) and
the line should have been
try_encodings(text, ['ascii', 'utf-8', 'iso8859_1_ncc', 'cp1252', 'macroman']
where iso8859_1_ncc.py is the same as iso8859_1.py from python
distribution, with this line different:
decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32,256))
--
-----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
More information about the Python-list
mailing list