[XML-SIG] xml.dom.ext.reader.HtmlLib

Lars Marius Garshol larsga@garshol.priv.no
18 Jul 2001 11:12:01 +0200


* Lars Marius Garshol
|
| Part of the problem here is that we have a separate Reader for HTML
| documents. IMHO it would be much preferrable to have a SAX driver for
| the HTML parser instead. That could then use the SAX Reader, and
| behaviour would be consistent. 
| 
| In addition, we would get increased flexibility by having a SAX driver
| for this parser.

* Martin v. Loewis
| 
| Sounds like an interesting project for a volunteer. 

I guess it would be. It's a very small task, really, but good for
learning. I would do it, but I haven't got the time.

| I'd personally recommend to build this SAX driver on top of sgmlop;
| the true challenge is to get the events right that only result from
| the SGML DTD for HTML (e.g. missing closing tags, etc).

So perhaps it would be better to integrate Tidy as a Python module?
It's a lot more work, but it would also be a lot more useful. If that
were done I think the module should have SAX as its interface. 

I think using the native expat interface was a mistake that has caused
us all kinds of problems. Instead of having just one interface for
parsers we ended up with several, because many people didn't want to
take the (slight) performance hit of using SAX.

So a SAX driver for expat written in C would be another good thing.

--Lars M.