unicode issue

Wed Sep 30 22:38:17 EDT 2009

Piet van Oostrum wrote:
>>>>>> Dave Angel <davea at ieee.org> (DA) wrote:
>>>>>>             
> [snip]
>   
>> DA> Thanks for the correction. What I meant by "works for me" is that the
>> DA> single example in the docstring translated okay. But I do have a lot to
>> DA> learn about using Unicode in sources, and I want to learn.
>>     
>
>   
>> DA> So tell me, how were we supposed to guess what encoding the original
>> DA> message used? I originally had the mailing list message (in Thunderbird
>> DA> email). When I copied (copy/paste) to Komodo IDE (text editor), it wouldn't
>> DA> let me save because the file type was ASCII. So I randomly chosen latin-1
>> DA> for file type, and it seemed to like it.
>>     
>
> You can see the encoding of the message in its headers. But it is not
> important, as the Unicode characters you see is what it is about. You
> just copy and paste them in your Python file. The Python file does not
> have to use the same encoding as the message from which you pasted. The
> editor will do the proper conversion. (If it doesn't throw it away
> immediately.) Only for the Python file you must choose an encoding that
> can encode all the characters that are in the file. In this case utf-8
> is the only reasonable choice, but if there are only latin-1 characters
> in the file then of course latin-1 (iso-8859-1) will also be good.
>
> Any decent editor will only allow you to save in an encoding that can
> encode all the characters in the file, otherwise you will lose some
> characters. 
>
> Because Python must also know which encoding you used and this is not in
> itself deductible from the file contents, you need the coding
> declaration. And it must be the same as the encoding in which the file
> is saved, otherwise Python will see something different than you saw in
> your editor. Sooner or later this will give you a big headache.
>
>   
>> DA> At that point I expected and got errors from Python because I had no coding
>> DA> declaration. I used latin-1, and still had problems, though I forget what
>> DA> they were. Only when I changed the file encoding type again, to utf-8, did
>> DA> the errors go away. I agree that they should agree, but I don't know how to
>> DA> reconcile the copy/paste boundary, the file type (without BOM, which is
>> DA> another variable), the coding declaration, and the stdout implicit ASCII
>> DA> encoding. I understand a bunch of it, but not enough to be able to safely
>> DA> walk through the choices.
>>     
>
>   
>> DA> Is this all written up in one place, to where an experienced programmer can
>> DA> make sense of it? I've nibbled at the edges (even wrote a UTF-8 
>> DA> encoder/decoder a dozen years ago).
>>     
>
> I don't know a place. Usually utf-8 is a safe bet but in some cases can
> be overkill. And then in you Python input/output (read/write) you may
> have to use a different encoding if the programs that you have to
> communicate with expect something different.
>   

I know what I was missing.  The copy/paste must be doing it in pure 
Unicode.  And the in-memory version of the source text is in Unicode.  
So the text editor's encoding affects how that Unicode is encoded into 8 
bit bytes for the file (and how it will be reloaded next time).  OK, 
that seems to make sense.

I know that the clipboard has type tags, but I haven't looked at them in 
so long that I forget what they look like.  For text, is it just ASCII 
and Unicode?  Or are there other possible encodings that the source and 
sink negotiate?

Thanks for the clear explanation.

DaveA