[Tutor] input file encoding

Tue Sep 11 12:50:11 CEST 2007

Tim Golden wrote:
> Tim Michelsen wrote:
>> Hello,
>> I want to process some files encoded in latin-1 (iso-8859-1) in my 
>> python script that I write on Ubuntu which has UTF-8 as standard encoding.
> 
> Not sure what you mean by "standard encoding" (is this an Ubuntu
> thing?) 

Probably referring to the encoding the terminal application expects - 
writing latin-1 chars when the terminal expects utf-8 will not work well.

Python also has a default encoding but that is ascii unless you change 
it yourself.

> In this case, assuming you have files in iso-8859-1, something
> like this:
> 
> <code>
> import codecs
> 
> filenames = ['a.txt', 'b.txt', 'c.txt']
> for filename in filenames:
>    f = codecs.open (filename, encoding="iso-8859-1")
>    text = f.read ()
>    #
>    # If you want to re-encode this -- not sure why --

This is needed to put the text into the proper encoding for the 
terminal. If you print a unicode string directly it will be encoded 
using the system default encoding (ascii) which will fail:

In [13]: print u'\xe2'
------------------------------------------------------------
Traceback (most recent call last):
   File "<ipython console>", line 1, in <module>
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode 
character u'\xe2' in position 0: ordinal not in range(128)

In [14]: print u'\xe2'.encode('utf-8')
â

>    # you could do this:
>    # text = text.encode ("utf-8")
>    print repr (text)

No, not repr, that will print with \ escapes and quotes.

In [15]: print repr(u'\xe2'.encode('utf-8'))
'\xc3\xa2'

And he may not want to change text itself to utf-8. Just
print text.encode('utf-8')

Kent