Unicode from Web to MySQL

Sat Dec 20 17:23:35 EST 2003

"Bill Eldridge" <bill at rfa.org> wrote in message news:mailman.375.1071937328.9307.python-list at python.org...
>
> I'm trying to grab a document off the Web and toss it
> into a MySQL database, but I keep running into the
> various encoding problems with Unicode (that aren't
> a problem for me with GB2312, BIG 5, etc.)
>
> What I'd like is something as simple as:
>
> CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
>
> import MySQLdb, re,urllib
>
> data = urllib.urlopen('http://localhost/test.html').read()

You've got 8-bit data here. urllib doesn't currently handle encoding issues,
maybe submitting sf feature request will change that. But right now you have
to do it yourself. Scan (or parse) the html header for encoding, if it's absent
grad the encoding from the http header. If it's absent too, then the encoding
ASAIR is latin1. So you code should look like:
connection = urllib.urlopen('http://localhost/test.html')
encoding = 'latin-1'
header_encoding = get_http_header_encoding(connection)
data = connection.read()
content_encoding = get_http_content_encoding(data)
if header_encoding:
    encoding = header_encoding
if content_encoding:
    encoding = content_encoding

>
> data2 = ???

data2 = data.decode(encoding,'replace')

> ...
> c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )

Quick scanning of mysql-python docs reveals that you should also
call connect with unicode='utf-8' parameter. Have you done that?

>
> where data2 is somehow the UTF-8 converted version of the original Web page.
>
> Additionally, I'd like to be able to do:
>
> body_expr  = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
>
> data = urllib.urlopen('http://localhost/test.html').read()
>
> main_body = body_expr.search(data).group(1)

Don't do that to data var because data is an 8bit string which
contains bytes not characters, use data2 instead.

As a rule of thumb you should decode to unicode as soon as you
can and leave unicode world as late as you can. And use unicode
aware APIs when they are available, this way you won't even
need to encode unicode objects.

>
> and insert that into the database, and most likely I need to
>
> I'm sitting with a dozen explanations from the Web explaining
> how to do this,
> 0) decode('utf-8','ignore') or 'strict', or 'replace'...
> 1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
>       re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)

You don't need re.UNICODE if you don't use  \w, \W, \b, or \B

> 2) Convert to unicode before UTF-8

Not sure what that means.

> 3) replace quotation marks within the SQL statement:
> data2.replace(u'"',u'\\"')

It's not a unicode problem, is it?

-- Serge.