Text Encoding - Like Wrestling Oiled Pigs

John Machin sjmachin at lexicon.net
Fri Dec 8 12:07:29 EST 2006


apotheos at gmail.com wrote:
> So I've got a problem.
>
> I've got a database of information that is encoded in Windows/CP1252.
> What I want to do is dump this to a UTF-8 encoded text file (a RSS
> feed).
>
> While the overall problem seems to be related to the conversion, the
> only error I'm getting is a
>
> "UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position
> 163: ordinal not in range(128)"
>
> So somewhere I'm missing an implicit conversion to ASCII which is
> completely aggrivating my brain.
>
> So, what fundamental issue am I completely overlooking?

That nowhere in your *code* do you mention "I've got a database of
information that is encoded in Windows/CP1252". This is not recorded
anywhere in your database. Python is fantastic, but we don't expect a
readauthorsmind() function until Python 4000 :-)

>
> Code follows.
>
[snip]
>
>     sql_query = "select story.subject as subject, story.content as
> content, story.summary as summary, story.sid as sid, posts.bid as
> board, posts.date_to_publish as date from story$

The above line has been mangled ... fortunately it doesn't affect the
diagnostic outcome.

[snip]
>
>
>           output.write(u'<description>' + unicode(descript) +
> u'</description>\n')     # this is the line that causes the error.

What is happening is that unicode(descript) has not been told what
encoding to use to decode your "Windows/CP1252" text, and it uses the
default encoding, "ascii". You need to put unicode(descript, 'cp1252').

Cheers,
John




More information about the Python-list mailing list