[Tutor] BeautifulSoup confusion

Stefan Behnel stefan_ml at behnel.de
Fri Apr 10 08:09:29 CEST 2009


Steve Lyskawa wrote:
> I am not a programmer by trade but I've been using Python for 10+ years,
> usually for text file conversion and protocol analysis.  I'm having a
> problem with Beautiful Soup.  I can get it to scrape off all the href links
> on a web page but I am having problems selecting specific URI's from the
> output supplied by Beautiful Soup.
> What exactly is it returning to me and what command would I use to find that
> out?  Do I have to take each line it give me and put it into a list before I
> can, for example, get only certain URI's containing a certain string or use
> the results to get the web page that the URI is referring to?
> 
> The pseudo code for what I am trying to do:
> 
> Get all URI's from web page that contain string "env.html"
> Open the web page it is referring to.
> Scrape selected information off of that page.

That's very easy to do with lxml.html, which offers an iterlinks() method
on elements to iterate over all links in a document (not only a-href, but
also in stylesheets, for example). It can parse directly from a URL, so you
don't need to go through urllib and friends, and it can make links in a
document absolute before iterating over them, so that relative links will
work for you are doing.

http://codespeak.net/lxml/lxmlhtml.html#working-with-links

Also, you should use the urlparse module to split the URL (in case it
contains parameters etc.) and check only the path section.

Stefan



More information about the Tutor mailing list