What encoding does u'...' syntax use?

Fri Feb 20 18:15:08 EST 2009

> Yes, I know that.  But every concrete representation of a unicode string 
> has to have an encoding associated with it, including unicode strings 
> produced by the Python parser when it parses the ascii string "u'\xb5'"
> 
> My question is: what is that encoding?

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

> Put this another way: I would have thought that when the Python parser 
> parses "u'\xb5'" it would produce the same result as calling 
> unicode('\xb5'), but it doesn't.

Right. In the former case, \xb5 denotes a Unicode character, namely
U+00B5, MICRO SIGN. It is the same as u"\u00b5", and still the same
as u"\N{MICRO SIGN}". By "the same", I mean "the very same".

OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
byte string with length 1, with a single byte with the numeric
value 0xb5, or 181. It does not, per se, denote any specific character.
It only gets a character meaning when you try to decode it to unicode,
which you do with unicode('\xb5'). This is short for

  unicode('\xb5', sys.getdefaultencoding())

and sys.getdefaultencoding() is (or should be) "ascii". Now, in
ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
a character at all), hence you get a UnicodeError.

> Instead it seems to produce the same 
> result as calling unicode('\xb5', 'latin-1').

Sure. However, this is only by coincidence, because latin-1 has the same
code points as Unicode (for 0..255).

> But my default encoding 
> is not latin-1, it's ascii.  So where is the Python parser getting its 
> encoding from?  Why does parsing "u'\xb5'" not produce the same error as 
> calling unicode('\xb5')?

Because \xb5 *directly* refers to character U+00b5, with no
byte-oriented encoding in-between.

Regards,
Martin