processing XHTML1.1 documents with xml.sax

Uche Ogbuji uche at ogbuji.net
Mon Aug 9 13:53:55 EDT 2004


webworldL at yahoo.com wrote in message news:<mailman.1321.1091845385.5135.python-list at python.org>...
> Has anybody had any luck processing XHTML1.1 documents with xml.sax?
> Whenever I try it, python loads the W3C DTD from the top, then crashes
> saying that there's an error in the external DTD.
> All I need to do is rip through a bunch of XHTML documents and extract
> some data, does anybody know a quick way to do this without sax making
> outgoing network connections and fussing with DTDs?
> 
> BTW, the code to reproduce the error if anybody cares:
> below is a document 'hello.html' produced by the W3C's Amaya:
> 
> <?xml version="1.0" encoding="iso-8859-1"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
>       "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
>   <meta http-equiv="Content-Type" content="text/html;
> charset=iso-8859-1" />
>   <title>Hello  World</title>
>   <meta name="generator" content="amaya 8.5, see
> http://www.w3.org/Amaya/" />
> </head>
> 
> <body>
> <p>hello world!</p>
> </body>
> </html>
> 
> and the script:
> 
> import xml.sax.handler
> xml.sax.parse("hello.html",
>     xml.sax.handler.ContentHandler()
>               )
> 
> the error:
> 
> SAXParseException:
> http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0:
> error in processing external entity reference
> 
> will be thrown.

Ouch.  I took a brief look at this and expat has a problem here.  I
should note that there are few more hairy stress tests of DTD
conformance than XHTMLMOD (the basis of XHTML 1.1).

Using the most recent expat, 1.95.8, something weird happens:

[uogbuji at borgia xmlwf]$ xmlwf -p ~/foo.xhtml
/home/uogbuji/http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd: No such
file or directory
/home/uogbuji/foo.xhtml:3:52: error in processing external entity
reference

It's a little confused about the fact that http:// starts a URL.  I
tried as much fiddling as I had time to, but I think there's little
recourse but for you to submit a bug report to the expat project:

http://sourceforge.net/tracker/?group_id=10127&atid=110127

And change your DTD to use XHTML 1.0 (which *does* work with expat)
rather than 1.1

Good luck.


-- 
Uche Ogbuji                                    Fourthought, Inc.
http://uche.ogbuji.net    http://4Suite.org    http://fourthought.com
Decomposition, Process, Recomposition -
http://www.xml.com/pub/a/2004/07/28/py-xml.html
Perspective on XML: Steady steps spell success with Google -
http://www.adtmag.com/article.asp?id=9663
Managing XML libraries - http://www.adtmag.com/article.asp?id=9160
Commentary on "Objects. Encapsulation. XML?" -
http://www.adtmag.com/article.asp?id=9090
Harold's Effective XML -
http://www.ibm.com/developerworks/xml/library/x-think25.html
A survey of XML standards -
http://www-106.ibm.com/developerworks/xml/library/x-stand4/



More information about the Python-list mailing list