[Tutor] converting encoded symbols from rss feed?

Serdar Tumgoren zstumgoren at gmail.com
Fri Jun 19 00:33:18 CEST 2009


> The example is written assuming the console encoding is utf-8. Yours
> seems to be cp437. Try this:

> In [1]: import sys
>
> In [2]: sys.stdout.encoding
> Out[2]: 'cp437'

That is indeed the result that I get as well.


> But there is another problem - \u2013 is an em dash which does not
> appear in cp437, so even giving the correct encoding doesn't work. Try
> this:
> In [6]: x = u"abc\u2591"
>
> In [7]: print x.encode('cp437')
> ------> print(x.encode('cp437'))
> abc░
>
So does this mean that my python install is incapable of encoding the
en/em dash?

For the time being, I've gone with treating the symptom rather than
the root problem and created a translate function.

def translate_code(text):
    text = text.replace("‘","'")
    text = text.replace("’","'")
    text = text.replace("“",'"')
    text = text.replace("”",'"')
    text = text.replace("–","-")
    text = text.replace("—","--")
    return text

Which of course has led to a new problem. I'm first using Fredrik
Lundh's code to extract random html gobbledygook, then running my
translate function over the file to replace the windows-1252 encoded
characters.

But for some reason, I can't seem to get my translate_code function to
work inside the same loop as Mr. Lundh's html cleanup code. Below is
the problem code:

infile = open('test.txt','rb')
outfile = open('test_cleaned.txt','wb')

for line in infile:
    try:
	newline = strip_html(line)
	cleanline = translate_code(newline)
	outfile.write(cleanline)
    except:
	newline = "NOT CLEANED: %s" % line
	outfile.write(newline)

infile.close()
outfile.close()

The strip_html function, documented here
(http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text
string as far as I can tell. I'm confused why I wouldn't be able to
further manipulate the string with the "translate_code" function and
store the result in the "cleanline" variable. When I try this
approach, none of the translations succeed and I'm left with the same
HTML gook in the "outfile".

Is there some way to combine these functions so I can perform all the
processing in one pass?


More information about the Tutor mailing list