Unicode literals and byte string interpretation.
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Oct 28 03:06:08 EDT 2011
On Thu, 27 Oct 2011 20:05:13 -0700, Fletcher Johnson wrote:
> If I create a new Unicode object u'\x82\xb1\x82\xea\x82\xcd' how does
> this creation process interpret the bytes in the byte string?
It doesn't, because there is no byte-string. You have created a Unicode
object from a literal string of unicode characters, not bytes. Those
characters are:
Dec Hex Char
130 0x82
177 0xb1 ±
130 0x82
234 0xea ê
130 0x82
205 0xcd Í
Don't be fooled that all of the characters happen to be in the range
0-255, that is irrelevant.
> Does it
> assume the string represents a utf-16 encoding, at utf-8 encoding,
> etc...?
None of the above. It assumes nothing. It takes a string of characters,
end of story.
> For reference the string is これは in the 'shift-jis' encoding.
No it is not. The way to get a unicode literal with those characters is
to use a unicode-aware editor or terminal:
>>> s = u'これは'
>>> for c in s:
... print ord(c), hex(ord(c)), c
...
12371 0x3053 こ
12428 0x308c れ
12399 0x306f は
You are confusing characters with bytes. I believe that what you are
thinking of is the following: you start with a byte string, and then
decode it into unicode:
>>> bytes = '\x82\xb1\x82\xea\x82\xcd' # not u'...'
>>> text = bytes.decode('shift-jis')
>>> print text
これは
If you get the encoding wrong, you will get the wrong characters:
>>> print bytes.decode('utf-16')
놂춂
If you start with the Unicode characters, you can encode it into various
byte strings:
>>> s = u'これは'
>>> s.encode('shift-jis')
'\x82\xb1\x82\xea\x82\xcd'
>>> s.encode('utf-8')
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf'
--
Steven
More information about the Python-list
mailing list