encoding problems (é and è)

Peter Otten __peter__ at web.de
Fri Mar 24 06:16:51 EST 2006


Duncan Booth wrote:

> There's a nice little codec from Skip Montaro for removing accents from
> latin-1 encoded strings. It also has an error handler so you can convert
> from unicode to ascii and strip all the accents as you do so:
> 
> http://orca.mojam.com/~skip/python/latscii.py
> 
>>>> import latscii
>>>> import htmlentitydefs
>>>> print u'\u00c9'.encode('ascii','replacelatscii')
> E
>>>> 
> 
> So Bussiere could replace a large chunk of his code with:
> 
>     ligneA = ligneA.decode(INPUTENCODING).encode('ascii',
>     'replacelatscii') ligneA = ligneA.upper()
> 
> INPUTENCODING is 'utf8' unless (one possible explanation for his problem)
> his files are actually in some different encoding.
> 
> Unfortunately, just as I finished writing this I discovered that the
> latscii module isn't as robust as I thought, it blows up on consecutive
> accented characters.
> 
>  :(

You made me look into it -- and I found that reusing the decoding map as the
encoding map lets you write

>>> u"Élève ééé".encode("latscii")
'Eleve eee'

without relying on the faulty error handler. I tried to fix the handler,
too:

>>> u"Élève ééé".encode("ascii", "replacelatscii")
'Eleve eee'
>>> g = u"\N{GREEK CAPITAL LETTER GAMMA}"
>>> (u"möglich ähnlich üblich ááá" + g*3).encode("ascii", "replacelatscii")
'moglich ahnlich ublich aaa???'

No real testing was performed.

Peter

--- latscii_old.py      2006-03-24 11:45:22.580588520 +0100
+++ latscii.py  2006-03-24 11:48:13.191651696 +0100
@@ -141,7 +141,7 @@

 ### Encoding Map

-encoding_map = codecs.make_identity_dict(range(256))
+encoding_map = decoding_map


 ### From Martin Blais
@@ -166,9 +166,9 @@
 ##   ustr.encode('ascii', 'replacelatscii')
 ##
 def latscii_error( uerr ):
-    key = ord(uerr.object[uerr.start:uerr.end])
+    key = ord(uerr.object[uerr.start])
     try:
-        return unichr(decoding_map[key]), uerr.end
+        return unichr(decoding_map[key]), uerr.start + 1
     except KeyError:
         handler = codecs.lookup_error('replace')
         return handler(uerr)





More information about the Python-list mailing list