Help with Parsing HTML files

Charlie Clark charlie at begeistert.org
Thu Aug 2 15:17:25 EDT 2001


As part of a prototype I need to be able to plug in several different
content websites and pull headlines to put them on my own website
through the medium of a database. I know that the normal way of doing is
the subscribing to some XML-based format but that isn't possible at the
moment as the streams would be too expensive (around Euro 5000 per
stream per month). We have a couple of Visual Basic scripts doing this
at the moment but I have suggested the move away: the scripts are not
easily reusable or extensible; Python would give us platform
independence and moving from VB + MS SQL to Python + PostgreSQL or
similar has a certain commercial logic.

Scenario: a wbe page providing content (x articles on the page all in
the same format) there are no handy comment tags in the source
differentiating the various parts of interest.

What's the best way to go about parsing the HTML? I've looked at sgmllib
and htmllib and am a bit lost. The worst thing for me about Python's
documentation is it's lack of examples. I leafed through all the Python
books in the bookshop today but failed to find much inspiration. One of
the problems I'll admit to having is not being able to work out how to
use a class simply by reading it's code - it just doesn't work for me
:-((

I see the following alternatives:

1) extend and improve on treating the source as plain-text. Making use
of regular expressions might be useful here.
2) use a library module to parse the html-source and get it to release
the appropriate objects

I'd really like to be able to have a system which could easily be
trained to deal with new source formats on a kind of template basis.

Here's a made up example source

<body>
...
<table>
<tr>
<td><img>Date</td>
<td><font>title</font><br><br>Article</td>
</tr>
... continues with the rest of the articles

I'm currently analysing the source and working out ways to separate
articles from each other and then read individual articles. As I'm
having to read the source into a single string I can see sgmllib and
htmllib calling, I just don't know what they are saying to me so at the
moment it's a question

while string contains articles:
    markers = [list of markers]
    start = string.find(markers[0])
    stop = string.find(markers[-1])
    article = string[start:stop]
    do_something_with_article(article, markers=markers) # pulls out the
contens and writes them into the database
    string = string[stop:]

Would it be possible to take this source and mark it up as a template
which would in turn generate markers for automated parsing? So
<td><img>Date</td>
would become
<!-- date_start --><td><img>Date</td><!-- date_end -->
Then a program could learn how to parse a new page based on the template
and happily go about doing it. This would separate templating from
programming and be useful in itself.

Would this be a good idea? How would I go about doing this "properly"
using the modules?

Many thanx for any help and pointers.

Charlie



More information about the Python-list mailing list