Is there a way to get utf-8 out of a Unicode string?

Mon Oct 30 04:12:25 EST 2006

Fredrik Lundh wrote:
> thebjorn wrote:
>
> > I've got a database (ms sqlserver) that's (way) out of my control,
> > where someone has stored utf-8 encoded Unicode data in regular varchar
> > fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
> > as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/
..
> first, check if you can get your database adapter to understand that the
> database contains UTF-8 and not ISO-8859-1.

It would be the way to go, however it looks like they've managed to get
Latin-1 data in exactly two columns in the entire database (this is a
commercial product of course, so there's no way for us to fix things).
And just to make things more interesting, I think I'm running into an
ADO bug where capital letters (outside the U+0000 to U+007F range) are
returning strange values:

>>> c.execute('create table utf8 (f1 varchar(15))')
>>> u'ÆØÅÉ'.encode('utf-8')
'\xc3\x86\xc3\x98\xc3\x85\xc3\x89'
>>> x = _
>>> c.execute('insert into utf8 (f1) values (?)', (x,))
>>> c.execute('select * from utf8')
>>> c.fetchall()
((u'\xc3\u2020\xc3\u02dc\xc3\u2026\xc3\u2030',),)
>>>

I haven't tested this through C[#/++] to verify that it's an ADO issue,
but it seems unlikely that MS would view this as anything but incorrect
usage no matter where the issue is...

Anyway, sorry for venting :-)

> if that's not possible, you can roundtrip via ISO-8859-1 yourself:
>
>  >>> u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
...
>  >>> print u.encode("iso-8859-1").decode("utf-8")
> Blåbærsyltetøy

That's very nice!

-- bjorn