Bug in htmlentitydefs.py with Python 3.0?

André andre.roberge at gmail.com
Wed Dec 26 18:59:08 EST 2007


On Dec 26, 7:30 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> André wrote:
> > In trying to parse html files using ElementTree running under Python
> > 3.0a1, and using htmlentitydefs.py to add "character entities" to the
> > parser, I found that I needed to create a customized version of
> > htmlentitydefs.py to make things work properly.
>
> Can you please state what precise problem you were seeing? The original
> code looks fine to me as it stands.
>

As stated above, I was using ElementTree to parse an html file and
sending the output to a browser.
Without an additional parser, I was getting the following error
message:

Traceback (most recent call last):
File "/Users/andre/CrunchySVN/branches/andre/src/http_serve.py",
  line 79, in do_POST self.server.get_handler(realpath)(self)
File "src/plugins/handle_default.py",
  line 53, in handler data = path_to_filedata(request.path,
root_path)
File "src/plugins/handle_default.py",
  line 39, in path_to_filedata return cp.create_vlam_page(open(npath),
path).read()
File "/Users/andre/CrunchySVN/branches/andre/src/CrunchyPlugin.py",
  line 98, in create_vlam_page return vlam.CrunchyPage(filehandle,
url, remote=remote, local=local)
File "/Users/andre/CrunchySVN/branches/andre/src/vlam.py",
  line 62, in __init__ self.tree =
parse(filehandle)#XmlFile(filehandle)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
  line 823, in parse tree.parse(source, parser)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
  line 561, in parse parser.feed(data)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
  line 1201, in feed self._parser.Parse(data, 0)
File "/usr/local/py3k/lib/python3.0/xml/etree/ElementTree.py",
  line 1157, in _default self._parser.ErrorColumnNumber)
xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11

So, I tried to specify an additional parser via

if python_version >= 3:
    import htmlentitydefs
    class XmlFile(ElementTree.ElementTree):
        def __init__(self, file=None):
            ElementTree.ElementTree.__init__(self)
            parser = ElementTree.XMLTreeBuilder(
                target=ElementTree.TreeBuilder(ElementTree.Element))
            ent = htmlentitydefs.entitydefs
            for entity in ent:
                if entity not in parser.entity:
                    parser.entity[entity] = ent[entity]
            self.parse(source=file, parser=parser)
            return

The output was "wrong".  For example, one of the test I used was to
process
a copy of the main dict of htmlentitydefs.py inside an html page.  A
few
of the characters came ok, but I got things like:

'Α':    0x0391, # greek capital letter alpha, U+0391

When using my modified version, I got the following (which may not be
transmitted properly by email...)
    'Α':    0x0391, # greek capital letter alpha, U+0391

It does look like a Greek capital letter alpha here.


> > It does work for me ... but I don't know enough about unicode to be
> > sure that it is a proper bug, and not a quirk due to the way I wrote
> > my app.
>
> Without knowing what the actual problem is, it is hard to tell.

I hope the above is of some help.

Regards,

André


>
> Regards,
> Martin




More information about the Python-list mailing list