unicode

Sun Jul 1 01:26:20 EDT 2007

Based on this example and the error:

-----
u_str = u"abc\u9999"
print u_str

UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
position 3: ordinal not in range(128)
------

it looks like when I try to display the string, the ascii decoder
parses each character in the string and fails when it can't convert a
numerical code that is higher than 127 to a character, i.e. the
character \u9999.

In the following example, I use encode() to convert a unicode string
to a regular string:

-----
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print repr(reg_str)
-----

and the output is:

'abc\xe9\xa6\x99'

1) Why aren't the characters 'a', 'b', and 'c' in hex notation?  It
looks like python must be using the ascii decoder to parse the
characters in the string again--with the result being python converts
only the 1 byte numerical codes to characters. 2) Why didn't that
cause an error like above for the 3 byte character?

Then if I try this:

---
u_str = u"abc\u9999"
reg_str = u_str.encode("utf-8")
print reg_str
---

I get the output:

abc<some chinese character>

Here it looks like python isn't using the ascii decoder anymore.  2)
What determines which decoder python uses?