A question about unicode() function

Mon Jan 1 02:57:33 EST 2007

JTree wrote:
> Hi,
>
> I changed my codes to:
>
> #!/usr/bin/python
> #Filename: test.py
> #Modified: 2007-01-01
>
> import cPickle as p
> import urllib
> import htmllib
> import re
> import sys
>
> funUrlFetch =  lambda url:urllib.urlopen(url).read()
>
> objUrl = raw_input('Enter the Url:')
> content = funUrlFetch(objUrl)
> content = content.encode('gb2312','ignore')

Why did you change what you had before? "content" is a str, encoded in
gb2312 (according to the internal evidence).  You are now pretending
that it is unicode, and trying to encode it as gb2312. However because
it is *not* unicode, Python tries to convert it to unicode first. What
you have coded above is equivalent to:
    content = content.decode('ascii').encode('gb2312', 'ignore')

and of course the *decode* fails, as the error message says:
Unicode*Decode*Error: 'ascii' codec can't decode byte 0xbb in position
88: ordinal not in range(128)

It never got any where near the encode()

So:
If you want a str encoded in gb2312, leave it alone.
If you want it in unicode, do this:
    ucontent = unicode(content, 'gb2312')

> print content

Try print repr(content)
It's much better for diagnostic purposes.

> content.close()

This will be your next problem; "content" refers to a str object or a
unicode object -- they don't have a close() method !!

>
> I used "ignore" to deal with the data lose, but it still caused a
> error:

What data loss???

>
> C:\WINDOWS\system32\cmd.exe /c python tianya.py
> Enter the Url:http://www.tianya.cn
> Traceback (most recent call last):
>   File "tianya.py", line 17, in ?
>     content = content.encode('gb2312','ignore')
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position
> 88: ordinal not in range(128)
> shell returned 1
> Hit any key to close this window...
>
> My python version is 2.4, Does it have some problems with asian
> encoding support?

"asian" is irrelevant. You would have got the same problem with just
about any non-ascii encoding, including cp1252 and similar encodings
commonly used in English-speaking countries and in western Europe. The
only encoding support problem with 2.4 is that it can't read your mind.

By the way, you should upgrade to 2.5, it can't read your mind either,
but it has more functionality etc :-)

HTH,
John