Text Encoding - Like Wrestling Oiled Pigs

Fri Dec 8 12:16:51 EST 2006

[apotheos at gmail.com]

| I've got a database of information that is encoded in Windows/CP1252.
| What I want to do is dump this to a UTF-8 encoded text file (a RSS
| feed).

| "UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position
| 163: ordinal not in range(128)"
| 
| So somewhere I'm missing an implicit conversion to ASCII which is
| completely aggrivating my brain.

[... snip ...]

          descript = each[2] + '<BR><BR>' + each[1]

|           output.write(u'<description>' + unicode(descript) +
| u'</description>\n')     # this is the line that causes the error.

Well, if the columns in the "each" row you're using are
unicode objects then that unicode(descript) is redundant
because descript will already be a unicode object. If
the columns are returned as encoded strings then you're 
not giving the unicode converter function any clues as
to what that encoding is.

You said above that the data is stored as cp1252, so why
not try unicode (descript, "cp1252")?

<code>
unicode ("\x92")
unicode ("\x92", "cp1252")

</code>

gives

<dump>
>>> unicode ("\x92")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 0:
ordinal not in range(128)
>>> unicode ("\x92", "cp1252")
u'\u2019'

</dump>

Or am *I* missing something more obvious?

TJG

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________