[XML-SIG] Re: HTML<->UTF-8 'codec'?
David Primmer
dave@primco.org
Thu, 7 Mar 2002 16:28:37 -0800
See:
http://mail.python.org/pipermail/xml-sig/2001-October/006214.html
ftp://ftp.parc.xerox.com/transient/janssen/htmlcodec.py,
I've downloaded Bill Janssen's module to escape UTF8 to HTML and =
vice-versa but I'm a python newbie and I really can't tell how make it =
work. I have some UTF8 data with a bunch of curly quotes that I'd like =
to turn them into html entities and this module seems perfect for it but =
it doesn't do what's expected and if anybody knows how to fix it or if =
there's another option beside writing my own re I'd appreciate it.=20
I should mention that I have=20
sys.setdefaultencoding('utf-8')=20
in my sitecustomize.py
It seems like I should use the decode function: "Decode takes UTF-8 HTML =
and converts all characters above the ASCII range to HTML character =
entity references." But it appears that the opposite is true.
This works...
>>> print 'I’ve had'.decode("html-utf-8")
I=E2=80=99ve had
>>> print 'I’ve had'.decode("html-utf-8").encode("html-utf-8")
I’ve had
Ok... but here's the problem. Using a cut'paste from my Word generated =
utf-8 file into IDLE I get:
>>> print 'I=E2=80=99ve had'.encode("html-utf-8")
I’ve had
Which makes a bunch of garbage in my browser of course.
At first I was thinking there was something wrong with my form of utf-8.
But Notepad and IE6 recognize it as utf-8 and open and display it fine =
and re-saving from notepad to utf-8 format gives the same result.
So I did research on this for a couple of hours and I made this test:
import htmlcodec
import unicodedata
import shutil
f=3Dopen('newfile.html','wb')
f.write(unicodedata.lookup('RIGHT DOUBLE QUOTATION MARK'))
f.close()
f=3Dopen('newfile.html','rb')
a =3D f.read()
b =3D a.encode('html-utf-8')
print 'from file'
print b
print 'no file'
print unicodedata.lookup('RIGHT DOUBLE QUOTATION =
MARK').encode('html-utf-8')
del f
results in:
from file
”
no file
”
Any help is appreciated!
Davep