Parsing HTML - modify URLs

richard richardjones at optushome.com.au
Wed Jul 7 19:04:25 EDT 2004


> michael at foord.net (Fuzzyman) writes:
>> "Robert Brewer" <fumanchu at amor.org> wrote in message
>> news:<mailman.69.1089211879.5135.python-list at python.org>...
>> > Haven't used it, but Beautiful Soup sounds like it fits the bill:
>> > 
>> > http://www.crummy.com/software/BeautifulSoup/
>> 
>> It talks about 'walkin the parse tree'... which is a bit more magic
>> than I want... I just want to modify URLs in tags... which means I
>> mainly want to extract the HTML unchanged and also modify a few tags -
>> HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
>> may have to try beautiful soup though :-)

>From the BeautifulSoup page:

 "You can modify a Tag or NavigableText in place. Printing it out as a
  string will print the new markup text."

And really, it handles *any* HTML, no matter how crappy - I'm using it to
deal with pages that have random <span> and </span> in them with no
matching end / start tags. Eugh.

Once you've written rewrite_url(), this will do the job on the BeautifulSoup
side:

  soup = BeautifulSoup()
  soup.feed(source_html)
  for tag, attr in (('img', 'src'), ('a', 'href')):
    for tag in soup(tag):
      if tag.get(attr):
        tag[attr] = rewrite_url(tag[attr])
  print soup


    Richard




More information about the Python-list mailing list