Unicode from Web to MySQL

Sat Dec 20 20:45:39 EST 2003

>>What I'd like is something as simple as:
>>
>>CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8));
>>
>>import MySQLdb, re,urllib
>>
>>data = urllib.urlopen('http://localhost/test.html').read()
>>    
>>
>
>data2 = data.decode(encoding,'replace')
>...
>c.execute(''' INSERT INTO junk ( junklet) VALUES ( '%s') ''' % data2 )
>  
>
>
>Quick scanning of mysql-python docs reveals that you should also
>call connect with unicode='utf-8' parameter. Have you done that?
>
>  
>
I added that now, but it doesn't seem to make much difference
(I think it's more for returning data from MySQL, not storing it,
but that will still be useful)

I did a test where I grabbed the URL using the same routines and
dumped the thing to a file, and then edited out all the English and
various HTML, and the SQL insert works at that point.

It seems the mixed language is throwing stuff off, which wouldn't
bother me if my re.search for only the Vietnamese text were working
properly, but it isn't.

>>where data2 is somehow the UTF-8 converted version of the original Web page.
>>
>>Additionally, I'd like to be able to do:
>>
>>body_expr  = re.compile('''<!-- MAIN -->(.*)<!-- /MAIN -->''')
>>
>>data = urllib.urlopen('http://localhost/test.html').read()
>>
>>main_body = body_expr.search(data).group(1)
>>    
>>
>
>Don't do that to data var because data is an 8bit string which
>contains bytes not characters, use data2 instead.
>
>As a rule of thumb you should decode to unicode as soon as you
>can and leave unicode world as late as you can. And use unicode
>aware APIs when they are available, this way you won't even
>need to encode unicode objects.
>
>
>  
>
>>and insert that into the database, and most likely I need to
>>
>>I'm sitting with a dozen explanations from the Web explaining
>>how to do this,
>>0) decode('utf-8','ignore') or 'strict', or 'replace'...
>>1) using re.compile('''(?u)<!-- MAIN>(.*)<!-- /MAIN -->'''),
>>      re.UNICODE+re.IGNORECASE+re.MULTILINE+re.DOTALL)
>>    
>>
>
>You don't need re.UNICODE if you don't use  \w, \W, \b, or \B
>
>  
>
>>2) Convert to unicode before UTF-8
>>    
>>
>
>Not sure what that means.
>
>  
>
>>3) replace quotation marks within the SQL statement:
>>data2.replace(u'"',u'\\"')
>>    
>>
>
>It's not a unicode problem, is it?
>
>-- Serge.
>
>
>  
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20031221/43493775/attachment.html>