Unicode from Web to MySQL

Skip Montanaro skip at pobox.com
Sat Dec 20 23:08:11 EST 2003


    Bill> Encoding for example is a UTF-8 page Vietnamese,
    Bill> try:

    Bill> http://www.rfa.org/service/index.html?service=vie
    ...

    Bill> I've tried grabbing this, doing vietstring.decode(None,'strict')
    Bill> gives an error (wants a string, not None), doing

Yeah, I can see that passing None would be a problem.  I assume vietstring
is a string object, not a unicode object?  If so, try

    uviet = unicode(vietstring, "utf-8")

which would give you a unicode object.  You can then convert it to other
encodings if you want.

Furthermore, if you know the string is already encoded as utf-8, you should
be able to just stuff it in your database as-is.  I store utf-8 strings in
MySQL all the time.

    Bill> unicode(data,'unicode','replace') fails,
    Bill> unicode(data,'raw-unicode-escape','replace') somewhat works,
    Bill> I can then try
    Bill> unicode(data,'raw-unicode-escape','replace').encode('utf-8')
    Bill> but I get a SQL error at that point.
    Bill> (SQL statement is:

    Bill> ' Insert INTO test_utf8 (title) VALUES ( '%s') ' % data2

    Bill> which if I put straight ascii text of some 1000 characters or so has
    Bill> no problem, Vietnamese gives me the SQL error.

Can you paste an actual Python session into an email?

Here's an example. Table looks like this:

    mysql> describe testac;
    +--------------+-------------+------+-----+---------+-------+
    | Field        | Type        | Null | Key | Default | Extra |
    +--------------+-------------+------+-----+---------+-------+
    | object_id    | int(11)     | YES  |     | NULL    |       |
    | object_title | varchar(64) | YES  |     | NULL    |       |
    +--------------+-------------+------+-----+---------+-------+
    2 rows in set (0.03 sec)

Code looks like this:

    >>> conn = MySQLdb.Connect(host="localhost", db="test", user="someuser",
    ...     passwd="somepasswd")
    >>> c = conn.cursor()
    >>> s = u'\u1234'
    >>> s
    u'\u1234'
    >>> s.encode('utf8')
    '\xe1\x88\xb4'
    >>> c.execute('insert into testac values (47, %s)', (s,))
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      File "/usr/local/lib/python2.2/site-packages/MySQLdb/cursors.py", line 95, in execute
        return self._execute(query, args)
      File "/usr/local/lib/python2.2/site-packages/MySQLdb/cursors.py", line 114, in _execute
        self.errorhandler(self, exc, value)
      File "/usr/local/lib/python2.2/site-packages/MySQLdb/connections.py", line 33, in defaulterrorhandler
        raise errorclass, errorvalue
    UnicodeError: Latin-1 encoding error: ordinal not in range(256)

Note that trying to directly insert a Unicode object fails because it tries
to perform a default encoding (in this case to Latin-1).  The Latin-1
charset doesn't have an encoding for that codepoint.

    >>> c.execute('insert into testac values (47, %s)', (s.encode('utf-8'),))
    1L

But encoding it as utf-8 succeeds.

    >>> c.execute('select * from testac where object_id=47')
    1L
    >>> c.fetchall()
    ((47L, '\xe1\x88\xb4'),)

And fetching that row shows that the object_title field does indeed have the
three-byte utf-8 encoded string.

Skip





More information about the Python-list mailing list