Dealing with "funny" characters

John Nagle nagle at animats.com
Sun Oct 21 12:56:50 EDT 2007


Diez B. Roggisch wrote:
>>    No, no, that's wrong.  MySQL and the Python interface to it understand
>> Unicode.  You don't want to convert data to UTF-8 before putting it in a
>> database; the database indexing won't work.
> 
> I doubt that indexing has anything to do with it whatsoever.

     Of course it does.  ORDER BY, LIKE, TRIM, and other SQL expressions that
do more than an equal comparison need to know the actual data representation.
If you were to convert to UTF-8 or UCS-2 in the Python program and send
the resulting byte string to MySQL, with MySQL thinking it was storing
ASCII or a BLOB, many SQL functions won't work right.  A database is
not a file system; a database looks at the data.
> 
>>    Here's how to do it right.
>>
>>    First, tell MySQL, before you create your MySQL tables, that the 
>> tables are
>> to be stored in Unicode:
>>
>>     ALTER database yourdatabasename DEFAULT CHARACTER SET utf8;
>>
>> You can also do this on a table by table basis, or even for single 
>> fields,
>> but you'll probably get confused if you do.
>>
>>    Then, when you connect to the database in Python, use something 
>> like this:
>>
>>     db = MySQLdb.connect(host="localhost",
>>         use_unicode = True, charset = "utf8",
>>         user=username, passwd=password, db=database)
>>
>> That tells MySQLdb to talk to the database in Unicode, and it tells 
>> the database
>> (via "charset") that you're talking Unicode.
> 
> You confuse unicode with utf-8 here.
... pontification deleted

> But due to e.g. environmental settings, opened connections will deliver 
> the contents in e.g. latin1. Which of course will lead to problems if 
> you try to return data from the table with the topmost chines first names.
   ???
> 
> So you can alter the encoding the connection delivers and expects 
> byte-strings in. In mysql, this can be done explcit using
> 
> cursor.execute("set names <encoding>")
> 
> Or - as you said - as part of a connection-string.
> 
> db = MySQLdb.connect(host="localhost",
>         use_unicode = True, charset = "utf8",
>         user=username, passwd=password, db=database)
> 
> 
> But there is more to it. If the DB-API supports it, then the API itself 
> will decode the returned strings, using the specified encoding, so that 
> the user will only deal with "real" unicode-objects, greatly reducing 
> the risk of mixing byte-strings with unicode-objects. That's what the 
> use_unicod-parameter is for: it makes the API accept and deliver 
> unicod-objects. But it would do so even if the charset-parameter was 
> "latin1".

     Yes, and that's taken care of if the connection is set up as above.
For legacy reasons, there are both "unicode" and "charset" parameters
to "connect", and they can be set in incompatible ways.
Also, you need MySQL 5 or later to get full Unicode support.

					John Nagle



More information about the Python-list mailing list