Subject changed: Bug in combining htmlentitydefs.py and ElementTree?

Wed Dec 26 22:36:29 EST 2007

Sorry for the top posting - I found out that the problem I encountered
was not something new in Python 3.0.

Here's a test program:
============
import xml.etree.ElementTree
ElementTree = xml.etree.ElementTree
import htmlentitydefs
class XmlParser(ElementTree.ElementTree):
    def __init__(self, file=None):
        ElementTree.ElementTree.__init__(self)
        parser = ElementTree.XMLTreeBuilder(
            target=ElementTree.TreeBuilder(ElementTree.Element))
        parser.entity = htmlentitydefs.entitydefs
        self.parse(source=file, parser=parser)
        return

f = open('test.html')
tree = XmlParser(f)
tree.write('test_out.html', encoding='utf-8')
======
This program should be run with the following test file (test.html):
=====
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>test</title>
</head>
<body>
<p>Α</p>
</body>
</html>
======
If run as such, it will print out the following:
------
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>test</title>
</head>
<body>
<p>&#913;</p>
</body>
</html>
-------
Notice how it is &#913; that appears instead of Α

This is the behaviour with both Python3.0 and 2.5.  (When I was
running with Python 2.5, I was always preprocessing the files with
BeautifulSoup, which removed many problems).

If I use "my_htmlentitiesdef.py" described in a previous message, I do
get an Alpha printed out (admittedly, not the character entity).

I would prefer to find a way to process such files and get Α
instead...  (or even, to process files with hard-coded characters e.g.
é instead of é and have them processed properly...).

unicode-challengedly-yrs,

André

On Dec 26, 9:14 pm, "André" <andre.robe... at gmail.com> wrote:
> On Dec 26, 8:53 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
>
> > > Without an additional parser, I was getting the following error
> > > message:
> > [...]
> > > xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11
>
> > To understand that problem better, it would have been helpful to see
> > what line 401, column 11 of the input file actually says. AFAICT,
> > it must have been something like "&é;" which would be really puzzling
> > to have in an XML file (usually, people restrict themselves to ASCII
> > for entity names).
>
> No, that one was é   (testing with my own name that appeared in
> a file).
>
>
>
> > >             for entity in ent:
> > >                 if entity not in parser.entity:
> > >                     parser.entity[entity] = ent[entity]
>
> > This looks fine to me.
>
> > > The output was "wrong".  For example, one of the test I used was to
> > > process a copy of the main dict of htmlentitydefs.py inside an html page.  A
> > > few of the characters came ok, but I got things like:
>
> > > 'Α':    0x0391, # greek capital letter alpha, U+0391
>
> > Why do you think this is wrong?
>
> Sorry, that was just cut-and-pasted from the browser (not the source);
> in the source of the processed html page, it is
> '&#913;':    0x0391, # greek capital letter alpha, U+0391
>
> i.e.  the "&" was transformed into "&" in a number of places (all
> places above ascii 127 I believe).
>
> Here are a few more lines extracted from the html file that was
> processed:
> =============
>     'Â':    0x00c2, # latin capital letter A with circumflex, U+00C2
> ISOlat1
>     'À':   0x00c0, # latin capital letter A with grave = latin capital
> letter A grave, U+00C0 ISOlat1
>     '&#913;':    0x0391, # greek capital letter alpha, U+0391
>     'Å':    0x00c5, # latin capital letter A with ring above = latin
> capital letter A ring, U+00C5 ISOlat1
>     'Ã':   0x00c3, # latin capital letter A with tilde, U+00C3 ISOlat1
>     'Ä':     0x00c4, # latin capital letter A with diaeresis, U+00C4
> ISOlat1
>     '&#914;':     0x0392, # greek capital letter beta, U+0392
>     'Ç':   0x00c7, # latin capital letter C with cedilla, U+00C7
> ISOlat1
>     '&#935;':      0x03a7, # greek capital letter chi, U+03A7
>     '&#8225;':   0x2021, # double dagger, U+2021 ISOpub
>     '&#916;':    0x0394, # greek capital letter delta, U+0394
> ISOgrk3
>   ============
>
>
>
> > > When using my modified version, I got the following (which may not be
> > > transmitted properly by email...)
> > >     'Α':    0x0391, # greek capital letter alpha, U+0391
>
> > > It does look like a Greek capital letter alpha here.
>
> > Sure, however, your first version ALSO has the Greek capital letter
> > alpha there; it is just spelled as Α (which *is* a valid spelling
> > for that latter in XML).
>
> Agreed that it would be... However that was not how it was
> transformed, see above; sorry if I was not clear about what was
> happening  (I should not have cut-and-pasted from the browser window).
>
>
>
> > > I hope the above is of some help.
>
> > Thanks; I now think that htmlentitydefs is just as fine as it always
> > was - I don't see any problem here.
>
> You may well be right in that the problem may lie elsewhere.  But as
> making the change I mentioned "fixed" the problem at my, I figured
> this was where the problem was located - and thought I should at least
> report it here.
>
> Regards,
> André
>
> > Regards,
> > Martin