unicode

Sun Jul 1 15:40:59 EDT 2007

Erik Max Francis wrote:
> 7stud wrote:
>
> > Based on this example and the error:
> >
> > -----
> > u_str = u"abc\u9999"
> > print u_str
> >
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
> > position 3: ordinal not in range(128)
> > ------
> >
> > it looks like when I try to display the string, the ascii decoder
> > parses each character in the string and fails when it can't convert a
> > numerical code that is higher than 127 to a character, i.e. the
> > character \u9999.
>
> If you try to print a Unicode string, then Python will attempt to first
> encode it using the default encoding for that file.  Here, it's apparent
> the default encoding is 'ascii', so it attempts to encode it into ASCII,
> which it can't do, hence the exception.  The error is no different from
> this:
>
>  >>> u_str = u'abc\u9999'
>  >>> u_str.encode('ascii')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u9999' in
> position 3: ordinal not in range(128)
>
> > In the following example, I use encode() to convert a unicode string
> > to a regular string:
> >
> > -----
> > u_str = u"abc\u9999"
> > reg_str = u_str.encode("utf-8")
> > print repr(reg_str)
> > -----
> >
> > and the output is:
> >
> > 'abc\xe9\xa6\x99'
> >
> > 1) Why aren't the characters 'a', 'b', and 'c' in hex notation?  It
> > looks like python must be using the ascii decoder to parse the
> > characters in the string again--with the result being python converts
> > only the 1 byte numerical codes to characters. 2) Why didn't that
> > cause an error like above for the 3 byte character?
>
> Since you've already encoded the Unicode object as a normal string,
> Python isn't trying to do any implicit encoding.  As for why 'abc'
> appears in plain text, that's just the way repr works:
>
>  >>> s = 'a'
>  >>> print repr(s)
> 'a'
>  >>> t = '\x99'
>  >>> print repr(t)
> '\x99'
>
> repr is attempting to show the string in the most readable fashion.  If
> the character is printable, then it just shows it as itself.  If it's
> unprintable, then it shows it in hex string escape notation.
>
> > Then if I try this:
> >
> > ---
> > u_str = u"abc\u9999"
> > reg_str = u_str.encode("utf-8")
> > print reg_str
> > ---
> >
> > I get the output:
> >
> > abc<some chinese character>
> >
> > Here it looks like python isn't using the ascii decoder anymore.  2)
> > What determines which decoder python uses?
>
> Again, that's because by already encoding it as a string, Python isn't
> doing any implicit encoding.  So it prints the raw string, which happens
> to be UTF-8, and which your terminal obviously supports, so you see the
> proper character.
>
> --
> Erik Max Francis && max at alcyone.com && http://www.alcyone.com/max/
>   San Jose, CA, USA && 37 20 N 121 53 W && AIM, Y!M erikmaxfrancis
>    Let us not seek the Republican answer or the Democratic answer but
>     the right answer. -- John F. Kennedy

So let me see if I have this right:

Here is some code:
-----
print "print unicode string:"
#print u"abc\u9999"   #error
print repr(u'abc\u9999')
print

print "print regular string containing chars in unicode syntax:"
print 'abc\u9999'
print repr('abc\u9999')
print

print "print regular string containing chars in utf-8 syntax:"
#encode() converts unicode strings to regular strings
print u'abc\u9999'.encode("utf-8")
print repr(u'abc\u9999'.encode("utf-8") )
-----

Here is the output:
-------
print unicode string:
u'abc\u9999'

print regular string containing chars in unicode syntax:
abc\u9999
'abc\\u9999'

print regular string containing chars in utf-8 syntax:
abc<chinese character>
'abc\xe9\xa6\x99'
------

1) If you print a unicode string:

*print implicitly calls str()*

a) str() calls encode(), and encode() tries to convert the unicode
string to a regular string.  encode() uses the default encoding, which
is ascii.  If encode() can't convert a character, then encode() raises
an exception.

b) repr() calls encode(), but if encode() raises an exception for a
character, repr() catches the exception and skips over the character
leaving the character unchanged.

2) If you print a regular string containing characters in unicode
syntax:

a) str() calls encode(), but if encode() raises an exception for a
character, str() catches the exception and skips over the character
leaving the character unchanged.  Same as 1b.

b) repr() similar to a), but repr() then escapes the escapes in the
string.

3) If you print a regular string containing characters in utf-8
syntax:

a) str() outputs the string to your terminal, and if your terminal can
convert the utf-8 numerical codes to characters it does so.

b) repr() blocks your terminal from interpreting the characters by
escaping the escapes in your string.  Why don't I see two slashes like
in the output for 2b?