Extracting data from HTML

Tue Jun 4 22:31:51 EDT 2002

Ian Bicking writes:
> I think you are misinterpreting Geoff's response, and you seem to
> have a chip on your shoulder about it.

Sorry about the attitude.  I'm frustrated that, twelve years into the
World Wide Web, there are still so many people who fail to learn from
its lessons.

> He did not compare XML-RPC to HTTP, but to HTML (at least, that's
> clearly implicit because this thread was talking about HTML
> parsing).  HTML is clearly a poor way to exchange machine-readable
> information, there are too many layout-related tags that are usually
> only appreciated by humans.

I don't really agree that HTML is a poor way to exchange
machine-readable information; that is, after all, what the language is
designed for.  But if the author of the HTML doesn't have
machine-readability in mind, and many don't, the HTML usually won't be
very machine-readable.

Nevertheless, Geoff said "XML-RPC or some other protocol that's
designed to carry data".  HTTP and XML-RPC are protocols, although
they each define data formats; HTML is a language.

**

With regard to your earlier question about htmllib and HTML::Parse: I
hadn't actually tried to use htmllib, but had only read that it was
unreliable on malformed HTML.  I haven't actually been able to feed it
HTML that's malformed enough to break it, though.

Here are link-extraction scripts in Perl using standard CPAN libraries
and in Python using standard Python libraries.  The Perl script is
more featureful.  The Python script was more painful to write, partly
because htmllib.HTMLParser has a more poorly designed interface than
HTML::Parser (hard as that may be to believe), but mostly because
there's already an HTML::LinkExtor in the library.

Perl version:

#!/usr/bin/perl -w
use strict;
require HTML::LinkExtor;
my $p = HTML::LinkExtor->new(\&cb, "http://www.sn.no/");
sub cb {
    my($tag, %links) = @_;
    print "$tag @{[%links]}\n";
}
$p->parse_file("pathological.html");
__END__

Python version:

#!/usr/bin/python
import htmllib, formatter
class x(htmllib.HTMLParser):
    def dump(self, tag, attrs):
        print tag,
        for a, v in attrs:
            if a in ['action', 'src', 'href']:
                print a, v,
        print
    def do_img(self, attrs):
        self.dump('img', attrs)
    def start_a(self, attrs):
        self.dump('a', attrs)
    def start_form(self, attrs):
        self.dump('form', attrs)

y = x(formatter.NullFormatter())
y.feed(open('pathological.html').read())
y.close()

Here's pathological.html, which I guess is not very pathological,
because it didn't break either script.  I'd be very interested to see
HTML pathological enough to break one or the other, but still accepted
by Netscape >3.0 or MSIE >4.0.

<ul>
Moroon & I are --&gt<b><B><i>GETTING MARRIED</b></i><-- next year.
<p> Look at my <a href=http://example.com/~moron>website.</a>
<body bgcolor=#ff7777>
<script>
This stuff shouldn't get displayed.
x = 1; y = 3;
if (x<y) x = y;
</script>
<IMG SRC='MYPIC.GIF'/>LOOK AT ME!
Sign my guestbook: <table><tr><td><form action=guestbook.cgi method=post>
<input name="yourname"><td><input type="submit"></form></table>

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)