encode/decode misunderstanding

Sun Jul 29 09:20:32 EDT 2007

Tim Arnold schrieb:
> Hi, I'm beginning to understand the encode/decode string methods, but I'd 
> like confirmation that I'm still thinking in the right direction:
> 
> I have a file of latin1 encoded text. Let's say I put one line of that file 
> into a string variable 'tocline', as follows:
> tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'
> 
> import codecs
> tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
> tocline = tocline.decode('latin1','replace')
> tocFile.write(tocline)
> tocFile.close()
> 
> What I think is that tocFile is wrapped to insure that anything written to 
> it is in utf8
> I decode the latin1 string into python's internal unicode encoding and that 
> gets written out as utf8.
> 
> Questions:
> what exactly is the tocline when it's read in with that \xe9 and \xed in the 
> string? A latin1 encoded string?

Yes. A simple, pure byte-string, that happens to contain bytes which 
under the latin1-encoding are "correct".

> Is my method the right way to write such a line out to a file with utf8 
> encoding?

Yes.

> If I read in the latin1 file using
> codecs.open(filename,encoding='latin1') and write out the utf8 file by 
> opening with
> codecs.open(othername,encoding='utf8'), would I no longer have a problem --  
> I could just read in latin1 and write out utf8 with no more worries about 
> encoding?

As long as you don't mix bytestrings and only use unicode-objects, you 
should be fine, yes.

Diez