Convert a list with wrong encoding to utf8

Gregory Ewing greg.ewing at canterbury.ac.nz
Fri Feb 15 03:08:18 EST 2019


vergos.nikolas at gmail.com wrote:
> [python] con = pymysql.connect( db = 'clientele', user = 'vergos', passwd =
> '******', charset = 'utf8' ) cur = con.cursor() [/python]
> 
> From that i understand that the names being fetched from the db to pyhton
> script are being fetced as utf8, right?

No, I don't think so.

As far as I can tell from a brief reading of the MySQL docs, that only
sets the *connection* encoding, which is concerned with transferring
data over the connection between the client and the server. It has no
bearing on the encoding used to decode data fetched from the database.

That's determined by metadata stored in the database itself. It seems
that MySQL lets you specify database encodings at three different levels:
for the database as a whole, for a specific table, and for a specific
field of a table.

What I think is happening is that the column you're reading the names
from is tagged in the database as being encoded in latin1, *but* this
is incorrect for some of the names, which are actually encoded in
utf8.

This would explain why some of the data you looked at was printed
with hex escapes, and why name.encode('latin1').decode('utf8')
appeared to fix it. The encode('latin1') gets back the original raw
bytes, and the decode('utf8') decodes them again using the correct
encoding.

However, not *all* of the data is like this -- some of it is correctly
stored in the database, and is coming back already correctly decoded
with no further processing needed. So, when you blindly try to re-code
all the names in the list, it fails on the first correctly-decoded one
it encounters.

Again, try printing out the whole list of names, and post it here (or
a good chunk of it if it's very long). It will give us a better idea
of what's going on.

If this theory is correct, then there isn't really any "right" way
to deal with it -- the fundamental problem is that the data in the
database is corrupted.

The best long-term solution would be to clean up the database. But if
you have to deal with it as it is, you'll need to use some kind of
heuristic to decide when a particular string needs "fixing", and
what needs to be done to fix it.

-- 
Greg



More information about the Python-list mailing list