base64 and unicode

Fri May 4 03:36:31 EDT 2007

EuGeNe Van den Bulke <eugene.vandenbulke at gmail.com> wrote:

> >>> import base64
> >>> base64.decode(file("hebrew.b64","r"),file("hebrew.lang","w"))
> 
> It runs but the result is not correct: some of the lines in hebrew.lang 
> are correct but not all of them (hebrew.expected.lang is the correct 
> file). I guess it is a unicode problem but can't seem to find out how to 
> fix it.

My guess would be that your problem is that you wrote the file in text 
mode, so (assuming you are on windows) all newline characters in the output 
are converted to carriage return/linefeed pairs. However, the decoded text 
looks as though it is utf16 encoded so it should be written as binary. i.e.  
the output mode should be "wb".

Simpler than using the base64 module you can just use the base64 codec. 
This will decode a string to a byte sequence and you can then decode that 
to get the unicode string:

with file("hebrew.b64","r") as f:
   text = f.read().decode('base64').decode('utf16')

You can then write the text to a file through any desired codec or process 
it first.

BTW, you may just have shortened your example too much, but depending on 
python to close files for you is risky behaviour. If you get an exception 
thrown before the file goes out of scope it may not get closed when you 
expect and that can lead to some fairly hard to track problems. It is much 
better to either call the close method explicitly or to use Python 2.5's 
'with' statement.