[Tutor] What is ™

Steven D'Aprano steve at pearwood.info
Fri Dec 23 00:30:25 CET 2011


bob gailer wrote:
>  >>> "™"
> '\xe2\x84\xa2'
> 
> What is this hex string?

Presumably you are running Python 2, yes? I will assume that you are using 
Python 2 in the following explanation.

You have just run smack bang into a collision between text and bytes, and in 
particular, the fact that your console is probably Unicode aware, but Python 
so-called strings are by default bytes and not text.

When you enter "™", your console is more than happy to allow you to enter a 
Unicode trademark character[1] and put it in between " " delimiters. This 
creates a plain bytes string. But the ™ character is not a byte, and shouldn't 
be treated as one -- Python should raise an error, but in an effort to be 
helpful, instead it tries to automatically encode that character to bytes 
using some default encoding. (Probably UTF-8.) The three hex bytes you 
actually get is the encoding of the TM character.

Python 2 does have proper text strings, but you have to write it as a unicode 
string:

py> s = u"™"
py> len(s)
1
py> s
u'\u2122'
py> print s
™
py> s.encode('utf-8')
'\xe2\x84\xa2'

Notice that encoding the trademark character to UTF-8 gives the same sequence 
of bytes as Python guesses on your behalf, which supports my guess that it is 
using UTF-8 by default.

If you take the three character byte string and decode it using UTF-8, you 
will get the trademark character back.

If all my talk of encodings doesn't mean anything to you, you should read this:

http://www.joelonsoftware.com/articles/Unicode.html




[1] Assuming your console is set to use the same encoding as my mail client is 
using. Otherwise I'm seeing something different to you.

-- 
Steven


More information about the Tutor mailing list