unicode

Sun Jul 1 16:14:18 EDT 2007

> 1) If you print a unicode string:
> 
> *print implicitly calls str()*

No. print does nothing if the object is already a string or unicode
object, and calls str() only otherwise.

> a) str() calls encode(), and encode() tries to convert the unicode
> string to a regular string.  encode() uses the default encoding, which
> is ascii.  If encode() can't convert a character, then encode() raises
> an exception.

Yes and no. This is what str() does, but str() isn't called. Instead,
print inspects sys.stdout.encoding, and uses that encoding to encode
the string. That, in turn, may raise an exception (in particular if
sys.stdout.encoding is "ascii" or not set).

> b) repr() calls encode(), but if encode() raises an exception for a
> character, repr() catches the exception and skips over the character
> leaving the character unchanged.

No. repr() never calls encode. Instead, each type, including unicode,
may have its own __repr__ which is called. unicode.__repr__ escapes
all non-ASCII characters.

> 2) If you print a regular string containing characters in unicode
> syntax:

No. There is no such thing:

py> len("\u")
2
py> "\u"[0]
'\\'
py> "\u"[1]
'u'

In a regular string, \u has no meaning, so \ stands just for itself.

> a) str() calls encode(), but if encode() raises an exception for a
> character, str() catches the exception and skips over the character
> leaving the character unchanged.  Same as 1b.

No. Printing a string never invokes .encode(), and no exception occurs
at all. Instead, the \ just gets printed as is.

> b) repr() similar to a), but repr() then escapes the escapes in the
> string.

str.__repr__ escapes the backslash just in case, so that it won't have
to check for the next character; in that sense, it generates a normal
form.

> 3) If you print a regular string containing characters in utf-8
> syntax:
> 
> a) str() outputs the string to your terminal, and if your terminal can
> convert the utf-8 numerical codes to characters it does so.

Correct. In general, you should always use the terminal's encoding
when printing to the terminal. That way, you can print everything
just fine what the terminal can display, and get an exception if
you try to print something that the terminal would be unable to
display.

> b) repr() blocks your terminal from interpreting the characters by
> escaping the escapes in your string.  Why don't I see two slashes like
> in the output for 2b?

str.__repr__ produces an output that is legal Python syntax for a string
literal. len(u'\u9999'.encode('utf-8')) is 3, so this Chinese character
really encodes as three separate bytes. As these are non-ASCII bytes,
__repr__ choses a representation that is legal Python syntax. For that
characters, only \xe9, \xa6 and \x99 are valid Python syntax (each
representing a single byte). For a backslash, Python could have
generated \x5c or \134 as well, which are all different spellings
of "backslash in a string literal". Python chose the most legible
one, which is the double-backslash.

HTH,
Martin