[2.5.1] ShiftJIS to Unicode?

Gilles Ganault nospam at nospam.com
Thu Nov 27 06:46:21 EST 2008


On Thu, 27 Nov 2008 01:00:28 +0000, MRAB <google at mrabarnett.plus.com>
wrote:
>No problem here:
>
> >>> import urllib
> >>> data = urllib.urlopen("http://www.amazon.co.jp/").read()
> >>> decoded_data = data.decode("shift-jis")
> >>>

Thanks, but it seems like some pages contain ShiftJIS mixed with some
other code page, and Python complains when trying to display this. I
ended up not displaying the string, and just sending it directly to
the database:

========
title = None
m = firsttry.search(the_page)
if m:
	try:
		title = m.group(1).decode('shift-jis').strip()
	except UnicodeEncodeError:
		title = m.group(1).decode('iso8859-1').strip()
	except:
		title = ""
else:
	m = secondtry.search(the_page)
	if m:
		try:
			title = m.group(1).decode('shift-jis').strip()
		except UnicodeEncodeError:
			title = m.group(1).decode('iso8859-1').strip()
		except:
			title = ""
	else:
		print "Nothing found for ISBN %s" % isbn

if title:
	#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
	#print "Found : %s" % title
	print "Found stuff"

sql = 'INSERT INTO books (title) VALUES (?)'
cursor.execute(sql,(title,))
========

Thank you



More information about the Python-list mailing list