Unicode literals and byte string interpretation.

Fri Oct 28 03:06:08 EDT 2011

On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote:

> If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
> this creation process interpret the bytes in the byte string? 

It doesn't, because there is no byte-string. You have created a Unicode 
object from a literal string of unicode characters, not bytes. Those 
characters are:

Dec Hex  Char
130 0x82 ‚
177 0xb1 ±
130 0x82 ‚
234 0xea ê
130 0x82 ‚
205 0xcd Í

Don't be fooled that all of the characters happen to be in the range 
0-255, that is irrelevant.

> Does it
> assume the string represents a utf-16 encoding, at utf-8 encoding,
> etc...?

None of the above. It assumes nothing. It takes a string of characters, 
end of story.

> For reference the string is これは in the 'shift-jis' encoding.

No it is not. The way to get a unicode literal with those characters is 
to use a unicode-aware editor or terminal:

>>> s = u'これは'
>>> for c in s:
...     print ord(c), hex(ord(c)), c
... 
12371 0x3053 こ
12428 0x308c れ
12399 0x306f は

You are confusing characters with bytes. I believe that what you are 
thinking of is the following: you start with a byte string, and then 
decode it into unicode:

>>> bytes = '\x82\xb1\x82\xea\x82\xcd'  # not u'...'
>>> text = bytes.decode('shift-jis')
>>> print text
これは

If you get the encoding wrong, you will get the wrong characters:

>>> print bytes.decode('utf-16')
놂춂

If you start with the Unicode characters, you can encode it into various 
byte strings:

>>> s = u'これは'
>>> s.encode('shift-jis')
'\x82\xb1\x82\xea\x82\xcd'
>>> s.encode('utf-8')
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'

-- 
Steven