[Tutor] BeautifulSoup confusion

Fri Apr 10 08:09:29 CEST 2009

Steve Lyskawa wrote:
> I am not a programmer by trade but I've been using Python for 10+ years,
> usually for text file conversion and protocol analysis.  I'm having a
> problem with Beautiful Soup.  I can get it to scrape off all the href links
> on a web page but I am having problems selecting specific URI's from the
> output supplied by Beautiful Soup.
> What exactly is it returning to me and what command would I use to find that
> out?  Do I have to take each line it give me and put it into a list before I
> can, for example, get only certain URI's containing a certain string or use
> the results to get the web page that the URI is referring to?
> 
> The pseudo code for what I am trying to do:
> 
> Get all URI's from web page that contain string "env.html"
> Open the web page it is referring to.
> Scrape selected information off of that page.

That's very easy to do with lxml.html, which offers an iterlinks() method
on elements to iterate over all links in a document (not only a-href, but
also in stylesheets, for example). It can parse directly from a URL, so you
don't need to go through urllib and friends, and it can make links in a
document absolute before iterating over them, so that relative links will
work for you are doing.

http://codespeak.net/lxml/lxmlhtml.html#working-with-links

Also, you should use the urlparse module to split the URL (in case it
contains parameters etc.) and check only the path section.

Stefan