What encoding does u'...' syntax use?

Fri Feb 20 16:15:51 EST 2009

In article <499f18bd$0$31879$9b4e6d93 at newsspool3.arcor-online.net>,
 Stefan Behnel <stefan_ml at behnel.de> wrote:

> Ron Garret wrote:
> > I would have thought that the answer would be: the default encoding 
> > (duh!)  But empirically this appears not to be the case:
> > 
> >>>> unicode('\xb5')
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: 
> > ordinal not in range(128)
> >>>> u'\xb5'
> > u'\xb5'
> >>>> print u'\xb5'
> > µ
> > 
> > (That last character shows up as a micron sign despite the fact that my 
> > default encoding is ascii, so it seems to me that that unicode string 
> > must somehow have picked up a latin-1 encoding.)
> 
> You are mixing up console output and internal data representation. What you
> see in the last line is what the Python interpreter makes of your unicode
> string when passing it into stdout, which in your case seems to use a
> latin-1 encoding (check your environment settings for that).
> 
> BTW, Unicode is not an encoding. Wikipedia will tell you more.

Yes, I know that.  But every concrete representation of a unicode string 
has to have an encoding associated with it, including unicode strings 
produced by the Python parser when it parses the ascii string "u'\xb5'"

My question is: what is that encoding?  It can't be ascii.  So what is 
it?

Put this another way: I would have thought that when the Python parser 
parses "u'\xb5'" it would produce the same result as calling 
unicode('\xb5'), but it doesn't.  Instead it seems to produce the same 
result as calling unicode('\xb5', 'latin-1').  But my default encoding 
is not latin-1, it's ascii.  So where is the Python parser getting its 
encoding from?  Why does parsing "u'\xb5'" not produce the same error as 
calling unicode('\xb5')?

rg