Anoying unicode / str conversion problem

Mon Jan 26 16:16:26 EST 2009

Hans Müller wrote:

> Hi python experts,
> 
> in the moment I'm struggling with an annoying problem in conjunction with
> mysql.
> 
> I'm fetching rows from a database, which the mysql drive returns as a list
> of tuples.
> 
> The default coding of the database is utf-8.
> 
> Unfortunately in the database there are rows with different codings and
> there is a blob column.
> 
> In the app. I search for double entries in the database with this code.
> 
> hash = {}
> cursor.execute("select * from table")
> rows = cursor.fetchall()
> for row in rows:
> key = "|".join([str(x) for x in row])         <- here the problem arises
> if key in hash:
> print "found double entry"
> 
> This code works as expected with python 2.5.2
> With 2.5.1 it shows this error:
> 
> 
> key = "|".join(str(x) for x in row)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in
> position 3: ordinal not in range(128)
> 
> When I replace the str() call by unicode(), I get this error when a blob
> column is being processed:
> 
> key = "|".join(unicode(x) for x in row)
> 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 119:
> ordinal not in range(128)
> 
> 
> Please help, how can I convert ANY column data to a string which is usable
> as a key to a dictionary. The purpose of using a dictionary is to find
> equal rows in some database tables. Perhaps using a md5 hash from the
> column data is also an idea ?
> 
> Thanks a lot in advance,

No direct answer, but can't you put the rows into the dict (or a set)
without converting them to a string?

seen = set()
for row in rows:
    if row in seen:
        print "dupe"
    else:
        seen.add(row)

Or, even better, solve the problem within the db:

select <fields> from <table> group by <fields> having count(*) > 1

Peter