[XML-SIG] Re: HTML<->UTF-8 'codec'?

Thu, 7 Mar 2002 16:28:37 -0800

See:
http://mail.python.org/pipermail/xml-sig/2001-October/006214.html

ftp://ftp.parc.xerox.com/transient/janssen/htmlcodec.py,

I've downloaded Bill Janssen's module to escape UTF8 to HTML and =
vice-versa but I'm a python newbie and I really can't tell how make it =
work. I have some UTF8 data with a bunch of curly quotes that I'd like =
to turn them into html entities and this module seems perfect for it but =
it doesn't do what's expected and if anybody knows how to fix it or if =
there's another option beside writing my own re I'd appreciate it.=20

I should mention that I have=20

sys.setdefaultencoding('utf-8')=20

in my sitecustomize.py

It seems like I should use the decode function: "Decode takes UTF-8 HTML =
and converts all characters above the ASCII range to HTML character =
entity references." But it appears that the opposite is true.

This works...

>>> print 'I&rsquo;ve had'.decode("html-utf-8")
I=E2=80=99ve had

>>> print 'I&rsquo;ve had'.decode("html-utf-8").encode("html-utf-8")
I&#8217;ve had

Ok... but here's the problem. Using a cut'paste from my Word generated =
utf-8 file into IDLE I get:

>>> print 'I=E2=80=99ve had'.encode("html-utf-8")
I&#226;&#128;&#153;ve had

Which makes a bunch of garbage in my browser of course.

At first I was thinking there was something wrong with my form of utf-8.

But Notepad and IE6 recognize it as utf-8 and open and display it fine =
and re-saving from notepad to utf-8 format gives the same result.

So I did research on this for a couple of hours and I made this test:

import htmlcodec
import unicodedata
import shutil

f=3Dopen('newfile.html','wb')
f.write(unicodedata.lookup('RIGHT DOUBLE QUOTATION MARK'))
f.close()

f=3Dopen('newfile.html','rb')
a =3D f.read()
b =3D a.encode('html-utf-8')
print 'from file'
print b
print 'no file'
print unicodedata.lookup('RIGHT DOUBLE QUOTATION =
MARK').encode('html-utf-8')
del f

results in:

from file
&#226;&#128;&#157;
no file
&#8221;

Any help is appreciated!

Davep