UnicodeEncodeError when not running script from IDE

Tue Feb 12 22:51:57 EST 2013

On 02/12/2013 07:20 PM, Magnus Pettersson wrote:
>
>> You don't show the code that actually does the io.open(), nor the
>>
>> url.encode, so I'm not going to guess what you're actually doing.
>
> Hmm im not sure what you mean but I wrote all code needed in a previous post so maybe you missed that one :)
> In short I basically just have:
> import io
> io.open(myfile,"a",encode="UTF-8") as f:
>      f.write(my_ustring_with_kanji)
>
> the url.encode() is my unicode string variable named "url" using the type built in  function .encode() which was the thing i wondered why i needed to use, which you explained very well, thank you!
>
> Just one more question since all this is still a little fuzzy in my head.
>
> When do i need to use .decode() in my code? is it when i read lines from f.ex a UTF-8 file? And why didn't I have to use .encode() on my unicode string when running from within eclipse pydev? someone wrote that it has a default codec setting so maybe that handles it for me there (which is kinda dangerous since my programs wont work running outside of eclipse since i didnt do any encoding or using of unicode strings before in my script and it still worked)
>

decode goes from bytes to unicode, the exact reverse.  And you're right, 
you'd need it on input from a file, and theoretically on input from a 
keyboard.

Conceptually, the easiest (not necessarily the fastest) thing to do is 
to always convert any input that comes in byte form to unicode, 
immediately on getting it. Then all processing in the code should be 
done in unicode form.  And you encode any output just before it goes out 
to a byte-device.

Python 3 makes that a natural, as the string type is already unicode, 
and it's byte strings that are the exception.  But all that really 
changes is the syntax you use.

There are defaults all over the place on these conversions.  And 
apparently, your IDE sets those defaults for you, which is a nasty 
thing, since it means things that run in the IDE will run differently 
outside of it.  You're just lucky the difference was an error.  If there 
weren't an error, you might have merrily been creating files with a 
mixture of encodings, which is a real disaster.

One other place where decoding happens is in your source file.  There is 
an optional encoding line you can place at the top of the file 
(immediately after the shebang line) to change how unicode literals with 
non-ASCII characters are interpreted.  Remember your source file is a 
byte file edited with some text editor, and it has been encoded, 
deliberately or accidentally by that editor.  You can avoid the issue by 
always using escape sequences, but if for example, you copy/paste some 
unicode string from an email message into your source code, you'd like 
it to be equivalent.  If your email program, your text editor, and your 
Python compiler are all on the same page, it works amazingly simply.

(That encoding line may affect other things;  I know in Python 3, it 
makes non-ASCII attribute names possible, but I'm not sure if it matters 
in Python 2.x other than for unicode literal strings)

-- 
DaveA