[2.5.1] ShiftJIS to Unicode?

Mark Tolonen metolone+gmane at gmail.com
Thu Nov 27 12:59:00 EST 2008


"Gilles Ganault" <nospam at nospam.com> wrote in message 
news:p02ti4pmotimkj0njbjt4e8qe0ejr3t7k9 at 4ax.com...
> On Thu, 27 Nov 2008 01:00:28 +0000, MRAB <google at mrabarnett.plus.com>
> wrote:
>>No problem here:
>>
>> >>> import urllib
>> >>> data = urllib.urlopen("http://www.amazon.co.jp/").read()
>> >>> decoded_data = data.decode("shift-jis")
>> >>>

This is correct.  You should read in the whole page and convert it to 
Unicode immediately.

>
> Thanks, but it seems like some pages contain ShiftJIS mixed with some
> other code page, and Python complains when trying to display this. I
> ended up not displaying the string, and just sending it directly to
> the database:
>
> ========
> title = None
> m = firsttry.search(the_page)
> if m:
> try:
> title = m.group(1).decode('shift-jis').strip()

You should not search the raw data and decode it later...decode the data 
when first brought into the program and do all processing in Unicode.

> except UnicodeEncodeError:
> title = m.group(1).decode('iso8859-1').strip()
> except:
> title = ""
> else:
> m = secondtry.search(the_page)
> if m:
> try:
> title = m.group(1).decode('shift-jis').strip()
> except UnicodeEncodeError:
> title = m.group(1).decode('iso8859-1').strip()
> except:
> title = ""
> else:
> print "Nothing found for ISBN %s" % isbn
>
> if title:
> #UnicodeEncodeError: 'charmap' codec can't encode characters in
> position 49-55: character maps to <undefined>
> #print "Found : %s" % title
> print "Found stuff"

Note here that you are getting an "encode" error.  When trying to print the 
data, Python will try to encode the Unicode data using the terminal's 
default encoding, which I suspect is not Shift-JIS.

-Mark

>
> sql = 'INSERT INTO books (title) VALUES (?)'
> cursor.execute(sql,(title,))
> ========
>
> Thank you
> --
> http://mail.python.org/mailman/listinfo/python-list
> 





More information about the Python-list mailing list