Parsing HTML - modify URLs

Wed Jul 7 18:48:28 EDT 2004

michael at foord.net (Fuzzyman) writes:

> "Robert Brewer" <fumanchu at amor.org> wrote in message news:<mailman.69.1089211879.5135.python-list at python.org>...
> > Fuzzyman wrote:
> > > I am trying to parse an HTML page an only modify URLs within tags -
> > > e.g. inside IMG, A, SCRIPT, FRAME tags etc...
> > > 
> > > I have built one that works fine using the HTMLParser.HTMLParser and
> > > it works fine.... on good HTML. Having done a google it looks like
> > > parsing dodgy HTML and having HTMLParser choke is a common theme.

Use sgmllib instead (or htmllib, which adds a few bits and bobs on top
of sgmllib).  sgmllib.SGMLParser (and htmllib.HTMLParser) is more
robust than HTMLParser.HTMLParser.  OTOH, HTMLParser.HTMLParser is
more suitable for XHTML.

I remember that sorting out the precise differences between the two
libraries (htmllib and HTMLParser) was mildly painful and confusing,
so you might find it useful to look at ClientForm as an example,
because it can use both htmllib and HTMLParser modules.

> > Haven't used it, but Beautiful Soup sounds like it fits the bill:
> > 
> > http://www.crummy.com/software/BeautifulSoup/
> 
> It talks about 'walkin the parse tree'... which is a bit more magic
> than I want... I just want to modify URLs in tags... which means I
> mainly want to extract the HTML unchanged and also modify a few tags -
> HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
> may have to try beautiful soup though :-)

In general, Murphy has more shots at anything that both parses *and*
builds a tree, so sticking to just a parser (eg. sgmllib) is
advantagous in that respect.  However, microdom is a tree-building
library that claims to be relatively tolerant of bad HTML.

John