Parsing HTML - modify URLs

Fuzzyman michael at foord.net
Thu Jul 8 03:16:13 EDT 2004


richard <richardjones at optushome.com.au> wrote in message news:<40ec817a$0$25460$afc38c87 at news.optusnet.com.au>...
> > michael at foord.net (Fuzzyman) writes:
> >> "Robert Brewer" <fumanchu at amor.org> wrote in message
> >> news:<mailman.69.1089211879.5135.python-list at python.org>...
> >> > Haven't used it, but Beautiful Soup sounds like it fits the bill:
> >> > 
> >> > http://www.crummy.com/software/BeautifulSoup/
> >> 
> >> It talks about 'walkin the parse tree'... which is a bit more magic
> >> than I want... I just want to modify URLs in tags... which means I
> >> mainly want to extract the HTML unchanged and also modify a few tags -
> >> HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
> >> may have to try beautiful soup though :-)
> 
> From the BeautifulSoup page:
> 
>  "You can modify a Tag or NavigableText in place. Printing it out as a
>   string will print the new markup text."
> 
> And really, it handles *any* HTML, no matter how crappy - I'm using it to
> deal with pages that have random <span> and </span> in them with no
> matching end / start tags. Eugh.
> 
> Once you've written rewrite_url(), this will do the job on the BeautifulSoup
> side:
> 
>   soup = BeautifulSoup()
>   soup.feed(source_html)
>   for tag, attr in (('img', 'src'), ('a', 'href')):
>     for tag in soup(tag):
>       if tag.get(attr):
>         tag[attr] = rewrite_url(tag[attr])
>   print soup
> 
> 
>     Richard

Brilliant Richard.
I did hack together a version that worked inside the Tag class of
BeautifulSoup - but your suggestion is much more elegant. I've already
written rewrite_url - twice now :-) Should work fine........

Thanks

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html



More information about the Python-list mailing list