Bug in htmlentitydefs.py with Python 3.0?

Wed Dec 26 17:37:31 EST 2007

In trying to parse html files using ElementTree running under Python
3.0a1, and using htmlentitydefs.py to add "character entities" to the
parser, I found that I needed to create a customized version of
htmlentitydefs.py to make things work properly.

The change needed was to replace (at the bottom of the file)
====
for (name, codepoint) in name2codepoint.items():
    codepoint2name[codepoint] = name
    if codepoint <= 0xff:
        entitydefs[name] = chr(codepoint)
    else:
        entitydefs[name] = '&#%d;' % codepoint
====
by
----
for (name, codepoint) in name2codepoint.items():
    codepoint2name[codepoint] = name
    entitydefs[name] = chr(codepoint)
----

It does work for me ... but I don't know enough about unicode to be
sure that it is a proper bug, and not a quirk due to the way I wrote
my app.  So, I thought it would be appropriate to post it here so that
unicode experts could determine if it was indeed a bug - and file a
bug report/write a patch.   The same code is present in Python 3.0a2 -
but I have not tested it under this new version.

André