Unicode from Web to MySQL

Bill Eldridge bill at rfa.org
Sat Dec 20 11:21:58 EST 2003


I'm trying to grab a document off the Web and toss it
into a MySQL database, but I keep running into the
various encoding problems with Unicode (that aren't
a problem for me with GB2312, BIG 5, etc.)

What I'd like is something as simple as:

CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));

import MySQLdb, re,urllib

data = urllib.urlopen('http://localhost/test.html').read()

data2 = ???
...
c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

where data2 is somehow the UTF-8 converted version of the original Web page.

Additionally, I'd like to be able to do:

body_expr  = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')

data = urllib.urlopen('http://localhost/test.html').read()

main_body = body_expr.search(data).group(1)

and insert that into the database, and most likely I need to

I'm sitting with a dozen explanations from the Web explaining
how to do this,
0) decode('utf-8','ignore') or 'strict', or 'replace'...
1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
      re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
2) Convert to unicode before UTF-8
3) replace quotation marks within the SQL statement: 
data2.replace(u'"',u'\\"')

etc., etc., but after numerous tries in the end I still keep getting 
either SQL errors or
the dreaded 'ascii' codec can't decode byte ... in position ...' errors.

Can someone give me any explanation of how to do this easily?

Thanks,
Bill









More information about the Python-list mailing list