urllib2.urlopen(url) pulling something other than HTML

Stefan Behnel stefan.behnel-n05pAM at web.de
Tue Aug 21 02:44:27 EDT 2007


dogatemycomputer at gmail.com wrote:
> I personally think the application itself "feels" more complicated
> than it needs to be but its possible that is just my inexperience. I'm
> going to do some reading about the HTMLParser module.  I'm sure I
> could make this spider a bit more functional in the process.

That's because you are using the standard library to parse HTML. While
HTMLParser can do what you want it to, it's rather hard to use, especially for
new users.

If you want to give lxml.html a try, a web spider would be something like this:

    import lxml.html as H

    def crawl(url, page_dict, depth=2, link_type="a"):
        html = H.parse(url).getroot()
        html.make_links_absolute()

        page_dict[url] = (link_type, html)

        for element, attribute_type, href in html.iterlinks():
            if href not in page_dict:
                if element.tag in ("a", "img"):  # ignore other link types
                    crawl(href, page_dict, depth-1, element.tag)

    page_dict = {}
    crawl("httt://www.google.com", page_dict, 2)

    # and now do something with the pages in page_dict.

lxml can actually do a lot more for you, just look through the docs to get an
idea. You can find lxml here:

http://codespeak.net/lxml

lxml.html is not yet released, though. Its first release (as part of lxml 2.0)
is expected around the end of august. You can find some docs here:

http://codespeak.net/lxml/dev

and you can (easily) install it from Subversion sources:

http://codespeak.net/svn/lxml/trunk

Have fun,
Stefan



More information about the Python-list mailing list