Python and encodings drives me crazy

John Machin sjmachin at lexicon.net
Mon Jun 20 19:38:02 EDT 2005


Oliver Andrich wrote:
> 2005/6/21, Konstantin Veretennicov <kveretennicov at gmail.com>:
> 
>>It does, as long as headline and caption *can* actually be encoded as
>>macroman. After you decode headline from utf-8 it will be unicode and
>>not all unicode characters can be mapped to macroman:
>>
>>
>>>>>u'\u0160'.encode('utf8')
>>
>>'\xc5\xa0'
>>
>>>>>u'\u0160'.encode('latin2')
>>
>>'\xa9'
>>
>>>>>u'\u0160'.encode('macroman')
>>
>>Traceback (most recent call last):
>>  File "<stdin>", line 1, in ?
>>  File "D:\python\2.4\lib\encodings\mac_roman.py", line 18, in encode
>>    return codecs.charmap_encode(input,errors,encoding_map)
>>UnicodeEncodeError: 'charmap' codec can't encode character u'\u0160' in position
>> 0: character maps to <undefined>
> 
> 
> Yes, this and the coersion problems Diez mentioned were the problems I
> faced. Now I have written a little cleanup method, that removes the
> bad characters from the input

By "bad characters", do you mean characters that are in Unicode but not 
in MacRoman?

By "removes the bad characters", do you mean "deletes", or do you mean 
"substitutes one or more MacRoman characters"?

If all you want to do is torch the bad guys, you don't have to write "a 
little cleanup method".

To leave a tombstone for the bad guys:

 >>> u'abc\u0160def'.encode('macroman', 'replace')
'abc?def'
 >>>

To leave no memorial, only a cognitive gap:

 >>> u'The Good Soldier \u0160vejk'.encode('macroman', 'ignore')
'The Good Soldier vejk'

Do you *really* need to encode it as MacRoman? Can't the Mac app 
understand utf8?

You mentioned cp850 in an earlier post. What would you be feeding 
cp850-encoded data that doesn't understand cp1252, and isn't in a museum?

Cheers,
John



More information about the Python-list mailing list