Spanish Accents

Peter Otten __peter__ at web.de
Thu Dec 22 13:17:30 EST 2011


Stan Iverson wrote:

> On Thu, Dec 22, 2011 at 12:42 PM, Peter Otten <__peter__ at web.de> wrote:
> 
>> The file is probably encoded in ISO-8859-1, ISO-8859-15, or cp1252 then:
>>
>> >>> print "\xe1".decode("iso-8859-1")
>> á
>> >>> print "\xe1".decode("iso-8859-15")
>> á
>> >>> print "\xe1".decode("cp1252")
>> á
>>
>> Try codecs.open() with one of these encodings.
>>
> 
> I'm baffled. I duplicated your print statements but when I run this code
> (or any of the 3 encodings):
> 
> file = codecs.open(p + "2.txt", "r", "cp1252")
> #file = codecs.open(p + "2.txt", "r", "utf-8")
> for line in file:
>   print line
> 
> I get this error:
> 
> *UnicodeEncodeError*: 'ascii' codec can't encode character u'\xe1' in
> position 48: ordinal not in range(128)

You are now one step further, you have successfully* decoded the file. 
The remaining step is to encode the resulting unicode lines back into bytes. 
The encoding implicitly used by the print statement is sys.stdout.encoding 
which is either "ascii" or None in your case. Try to encode explicitly to 
UTF-8 with

f = codecs.open(p + "2.txt", "r", "iso-8859-1")
for line in f:
    print line.encode("utf-8")

(*) This is however no big achievement as two (ISO-8859-1, ISO-8859-15) of 
the above codecs would not even balk on a binary file, e.g. a jpeg. They 
offer a character for every possible byte value.




More information about the Python-list mailing list