Unicode from Web to MySQL

Sat Dec 20 19:18:20 EST 2003

Serge Orlov wrote:

>"Bill Eldridge" <bill at rfa.org> wrote in message news:mailman.375.1071937328.9307.python-list at python.org...
>  
>
>>I'm trying to grab a document off the Web and toss it
>>into a MySQL database, but I keep running into the
>>various encoding problems with Unicode (that aren't
>>a problem for me with GB2312, BIG 5, etc.)
>>
>>What I'd like is something as simple as:
>>
>>CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
>>
>>import MySQLdb, re,urllib
>>
>>data = urllib.urlopen('http://localhost/test.html').read()
>>    
>>
>
>You've got 8-bit data here. urllib doesn't currently handle encoding issues,
>maybe submitting sf feature request will change that. But right now you have
>to do it yourself. Scan (or parse) the html header for encoding, if it's absent
>grad the encoding from the http header.
>
This part is fairly known - I'm setting up feeds that I'll actually look
at to scrape content, so identifying the encoding will be part of that.
. What's driving me crazy is knowing the encoding
but still not getting the data all the way through the chain to MySQL.

> If it's absent too, then the encoding
>ASAIR is latin1. So you code should look like:
>connection = urllib.urlopen('http://localhost/test.html')
>encoding = 'latin-1'
>header_encoding = get_http_header_encoding(connection)
>data = connection.read()
>content_encoding = get_http_content_encoding(data)
>if header_encoding:
>    encoding = header_encoding
>if content_encoding:
>    encoding = content_encoding
>
>  
>
The latin-1 stuff isn't giving me problems, it's the Asian languages,
but I'll look at the connection end.

>>data2 = ???
>>    
>>
>
>data2 = data.decode(encoding,'replace')
>
>  
>
>>...
>>c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )
>>    
>>
>
>Quick scanning of mysql-python docs reveals that you should also
>call connect with unicode='utf-8' parameter. Have you done that?
>
>  
>
No, I haven't, I'll try it.

>>where data2 is somehow the UTF-8 converted version of the original Web page.
>>
>>Additionally, I'd like to be able to do:
>>
>>body_expr  = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
>>
>>data = urllib.urlopen('http://localhost/test.html').read()
>>
>>main_body = body_expr.search(data).group(1)
>>    
>>
>
>Don't do that to data var because data is an 8bit string which
>contains bytes not characters, use data2 instead.
>
>  
>
Alright, I've tried it both ways, but this makes it clearer why.

>As a rule of thumb you should decode to unicode as soon as you
>can and leave unicode world as late as you can. And use unicode
>aware APIs when they are available, this way you won't even
>need to encode unicode objects.
>
>
>  
>
>>and insert that into the database, and most likely I need to
>>
>>I'm sitting with a dozen explanations from the Web explaining
>>how to do this,
>>0) decode('utf-8','ignore') or 'strict', or 'replace'...
>>1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
>>      re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
>>    
>>
>
>You don't need re.UNICODE if you don't use  \w, \W, \b, or \B
>
>  
>
Thanks, I don't.

>>2) Convert to unicode before UTF-8
>>    
>>
>
>Not sure what that means.
>
>  
>
data.decode(None,'strict')
or
unicode(data,'unicode','strict')

>>3) replace quotation marks within the SQL statement:
>>data2.replace(u'"',u'\\"')
>>    
>>
>
>It's not a unicode problem, is it?
>
>  
>

Occasionally instead of getting the encoding error I get a SQL syntax error,
and figured somewhere it was misinterpreting something like the end 
delimiter.
No proof though, just a guess, so I tried the replaces.

Thanks much,
Bill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20031221/9bb26c90/attachment.html>