[XML-SIG] xml / html parsing for webbot

uche.ogbuji@fourthought.com uche.ogbuji@fourthought.com
Sun, 10 Dec 2000 06:32:03 -0700


> > Especially I found some pages which were generated by scripts, do
> > contain unmatched tags in the pages. How the two approaches handle
> > them?
> 
> For that purpose, the DOM authors made special support for HTML. You
> normally need a special parser, one that is capable of processing
> HTML, and still building a DOM tree. PyXML now includes 4DOM, which, I
> believe, is capable of converting arbitrary HTML into a DOM tree.

Correct as usual, Martin, although Python's standard htmllib gets much of the 
credit for wrangling unruly HTML.

Here's a little demo.  It shows how to read in any HTML and print out shiny 
XHTML.  Basically, it has the functionality of the highly popular Tidy 
(http://www.w3.org/People/Raggett/tidy/) or JTidy (http://lempinen.net/sami/jti
dy/) but with XHTML output (Can be easily modified to produce cleaned HTML 
output)

[uogbuji@borgia one-offs]$ cat html-to-xhtml-converter.py 
import sys
from xml.dom.ext.reader import HtmlLib
import xml.dom.ext

#set up a re-usable reader object
reader = HtmlLib.Reader()

#parse HTML ffrom file or URI given on command line.  Return the DOM document
doc = reader.fromUri(sys.argv[1])

#Just for kicks, write it out as XHTML, i.e. all lowercase, XML syntax for 
empty tags, all attributes with given value, etc.

xml.dom.ext.XHtmlPrettyPrint(doc)

[uogbuji@borgia one-offs]$ cat data/example-from-wsdl-xslt-article.html 
<HTML>
  <HEAD>
    <TITLE>Service summary: EndorsementSearch</TITLE>
    <META charset='UTF-8' HTTP-EQUIV='content-type' CONTENT='text/html'>
  </HEAD>
  <BODY STYLE='background: #ffffff'>
    <H1>Service summary: EndorsementSearch</H1>
    <HR>
    <TABLE>
      <THEAD>Service: EndorsementSearchService</THEAD>
      <TBODY>
        <TR>
          <TD STYLE='background: #ccffff' COLSPAN='3'>
            <I>snowboarding-info.com Endorsement Service</I>
          </TD>
        </TR>
        <TR>
          <TD>Port: </TD>
          <TD STYLE='background: #ffccff'>http://www.snowboard-info.com/Endorse
mentSearch</TD>
          <TD STYLE='background: #ff66ff'>SOAP</TD>
        </TR>
      </TBODY>
    </TABLE>
  </BODY>
</HTML>
[uogbuji@borgia one-offs]$ python html-to-xhtml-converter.py 
data/example-from-wsdl-xslt-article.html
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"DTD/xhtml1-strict.dtd">

<html xmlns = 'http://www.w3.org/1999/xhtml'>
 <head>
  <title/>Service summary: EndorsementSearch
  <meta charset='UTF-8' http-equiv='content-type' content='text/html'/>
 </head>
 <body style='background: #ffffff'>
  <h1>Service summary: EndorsementSearch</h1>
  <hr/>
  <table>
   <thead/>Service: EndorsementSearchService
   <tbody/>
   <tr>
    <td style='background: #ccffff' colspan='3'>
     <i>snowboarding-info.com Endorsement Service</i>
    </td>
   </tr>
   <tr>
    <td>Port:</td>
    <td style='background: #ffccff'>http://www.snowboard-info.com/EndorsementSe
arch</td>
    <td style='background: #ff66ff'>SOAP</td>
   </tr>
  </table>
 </body>
</html>
[uogbuji@borgia one-offs]$ 


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python