Encoding Text

Sun May 4 04:57:35 EDT 2008

Paul Jefferson wrote:
> I'm learning this and I'm making a program which takes RSS feeds and 
> processes them and then outputs them to a HTML file.
> The problem I have is that some of the RSS feeds contain chachters which 
> I think are outside of the ascii range as when I attempt to write the 
> file containing themI get the follwoign error:

> File "C:\Users\Paul\Desktop\python\Feed reader\test.py", line 201, in 
> <module>
>     runner(1,"save.ahc")
>   File "C:\Users\Paul\Desktop\python\Feed reader\test.py", line 185, in 
> runner
>     h.write(html[countera])
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in 
> position 1147: ordinal not in range(128)

> For the life of me I can't figure out what I should do to stop this - I 
> tried to change the html by doing a html[countera] = 
> unicode(html[countera]) but it didn't seem to chaneg anything.

I assume from the traceback that html[countera] contains a unicode
object and that h is a regular file open for writing. If that's the
case then you need to do one of two things to have the string written
correctly: either open/wrap the file via the codecs module to specify
the encoding through which unicode objects should be written; or
explicitly encode the unicode objects as they're written.

If you don't do one of those two things then Python will try to
convert your unicode object naively so that it can be written to
file, and it will use the ascii codec which can't handle anything
outside the basic stuff. You've got a right-quote mark in your
unicode, probably from a Windows program.

Since the latter is easier to demonstrate from the code above:

<code snippet>
h.write (html[countera].encode ("utf8"))
</code snippet>

but a more general-purpose solution is probably:

<code snippet>
import codecs

h = codecs.open ("save.ahc", "w", encoding="utf8")
.
.
h.write (html[countera])
</code snippet>

TJG