Unicode from Web to MySQL

Bill Eldridge bill at rfa.org
Sat Dec 20 20:09:16 EST 2003


Skip Montanaro wrote:

>    Bill> Note that I am able to do create Unicode data and insert it with a
>    Bill> carefully controlled unicode string
>
>    Bill> data = u"Make \u0633\u0644\u0627\u0645, not war"
>    Bill> c.execute ( INSERT INTO junk (junklet) VALUES ('%s') ''' % 
>    Bill> data.encode('utf-8','ignore')
>
>    Bill> but this won't work with what I find on the Web.
>
>I suspect you either don't know the encoding of the data you find on the
>web.  Once you know that, you can convert it to unicode, then encode that as
>utf-8, placing the result into the database.  You should know the encoding
>of the data from the Content-Type header.  If that's missing or incorrect,
>you should be able to make a reasonable guess based upon the non-ASCII
>patterns you find in the data.
>
>  
>

Encoding for example is a UTF-8 page Vietnamese,
try:

http://www.rfa.org/service/index.html?service=vie
or
 http://www.rfa.org/service/article.html?service=vie&encoding=9&id=123655

I've tried grabbing this, doing vietstring.decode(None,'strict')
gives an error (wants a string, not None), doing
unicode(data,'unicode','replace') fails,
unicode(data,'raw-unicode-escape','replace') somewhat works,
I can then try
unicode(data,'raw-unicode-escape','replace').encode('utf-8')
but I get a SQL error at that point.
(SQL statement is:

' Insert INTO test_utf8 (title) VALUES ( '%s') ' % data2

which if I put straight ascii text of some 1000 characters or so has
no problem, Vietnamese gives me the SQL error.






More information about the Python-list mailing list