Unicode question

Thu Jul 17 21:59:39 EDT 2003

On Fri, 18 Jul 2003 02:07:13 +0200, Gerhard Häring wrote:

> Thomas Heller wrote:
>> Gerhard Häring <gh at ghaering.de> writes:
>> 
>> 
>>> >>> u"äöü"
>>>u'\x84\x94\x81'
>>>
>>>(Python 2.2.3/2.3b2; sys.getdefaultencoding() == "ascii")
>>>
>>>Why does this work?
>>>
>>>Does Python guess which encoding I mean? I thought Python should
>>>refuse to guess :-)
>> 
>> 
>> I stumbled over this yesterday, and it seems it is (at least) partially
>> answered by PEP 263:
>> 
>>     In Python 2.1, Unicode literals can only be written using the
>>     Latin-1 based encoding "unicode-escape". This makes the programming
>>     environment rather unfriendly to Python users who live and work in
>>     non-Latin-1 locales such as many of the Asian countries. Programmers
>>     can write their 8-bit strings using the favorite encoding, but are
>>     bound to the "unicode-escape" encoding for Unicode literals.
>> 
>> I have the impression that this is undocumented on purpose, because you
>> should not write unescaped non-ansi characters into the source file
>> (with 'unknown' encoding).
> 
> I agree that using latin1 as default is bad. If there's an encoding 
> cookie in the 2.3+ source file then this encoding could be used.
> -- Gerhard

You can use string literals in any encoding like this:
'string in my favorite encoding'.decode('my favorite encoding'). 
Note that the lack of the u prefix. Not very confortable though..
u'string' ends up doing the same as 'string'.decode('latin1').
It doesn't work for docstrings though..

I'm not sure for what you mean about encoding cookie, but I like the idea
of each source file having some element that defines the encoding used to
process string literals.
Either that or we define the Python code must be written in UTF-8. But
that would break lots of code.. :D

-- 
	Ricardo