Bug in htmlentitydefs.py with Python 3.0?
André
andre.roberge at gmail.com
Wed Dec 26 18:59:08 EST 2007
On Dec 26, 7:30 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> André wrote:
> > In trying to parse html files using ElementTree running under Python
> > 3.0a1, and using htmlentitydefs.py to add "character entities" to the
> > parser, I found that I needed to create a customized version of
> > htmlentitydefs.py to make things work properly.
>
> Can you please state what precise problem you were seeing? The original
> code looks fine to me as it stands.
>
As stated above, I was using ElementTree to parse an html file and
sending the output to a browser.
Without an additional parser, I was getting the following error
message:
Traceback (most recent call last):
File "/Users/andre/CrunchySVN/branches/andre/src/http_serve.py",
line 79, in do_POST self.server.get_handler(realpath)(self)
File "src/plugins/handle_default.py",
line 53, in handler data = path_to_filedata(request.path,
root_path)
File "src/plugins/handle_default.py",
line 39, in path_to_filedata return cp.create_vlam_page(open(npath),
path).read()
File "/Users/andre/CrunchySVN/branches/andre/src/CrunchyPlugin.py",
line 98, in create_vlam_page return vlam.CrunchyPage(filehandle,
url, remote=remote, local=local)
File "/Users/andre/CrunchySVN/branches/andre/src/vlam.py",
line 62, in __init__ self.tree =
parse(filehandle)#XmlFile(filehandle)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 823, in parse tree.parse(source, parser)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 561, in parse parser.feed(data)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 1201, in feed self._parser.Parse(data, 0)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
line 1157, in _default self._parser.ErrorColumnNumber)
xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11
So, I tried to specify an additional parser via
if python_version >= 3:
import htmlentitydefs
class XmlFile(ElementTree.ElementTree):
def __init__(self, file=None):
ElementTree.ElementTree.__init__(self)
parser = ElementTree.XMLTreeBuilder(
target=ElementTree.TreeBuilder(ElementTree.Element))
ent = htmlentitydefs.entitydefs
for entity in ent:
if entity not in parser.entity:
parser.entity[entity] = ent[entity]
self.parse(source=file, parser=parser)
return
The output was "wrong". For example, one of the test I used was to
process
a copy of the main dict of htmlentitydefs.py inside an html page. A
few
of the characters came ok, but I got things like:
'Α': 0x0391, # greek capital letter alpha, U+0391
When using my modified version, I got the following (which may not be
transmitted properly by email...)
'Α': 0x0391, # greek capital letter alpha, U+0391
It does look like a Greek capital letter alpha here.
> > It does work for me ... but I don't know enough about unicode to be
> > sure that it is a proper bug, and not a quirk due to the way I wrote
> > my app.
>
> Without knowing what the actual problem is, it is hard to tell.
I hope the above is of some help.
Regards,
André
>
> Regards,
> Martin
More information about the Python-list
mailing list