writing a web client

Fri Jul 29 17:19:11 EDT 2005

"Fuzzyman" <fuzzyman at gmail.com> writes:

> Ajar wrote:
>> I want to write a program which will automatically login to my ISPs
>> website, retrieve data and do some processing. Can this be done? Can
>> you point me to any example python programs which do similar things?
>>
>> Regards,
>> Ajar
>
> Very easily. Have a look at my article on the ``urllib2`` module.
>
> http://www.voidspace.org.uk/python/articles.shtml#http
>
> You may need to use ClientCookie/cookielib to handle cookies and may
> have to cope with BASIC authentication. There are also articles about
> both of these as well.
>
> If you want to handle filling in forms programattically then the module
> ClientForm is useful (allegedly).

The last piece of the puzzle is BeautifulSoup. That's what you use to
extract data from the web page.

For instance a lot of web pages listing data have something like this
on it:

<table>
...
<tr><th>Item:</th><td>Value</td></tr>
...
</table>

You can extract value from such with BeautifulSoup by doing something like:

soup.fetchText('Item:')[0].findParent(['td', 'th']).nextSibling.string

Where this checks works for the item being in either a td or th tag.

Of course, I recommend doing things a little bit more verbosely. In my
case, I'm writing code that's expected to work on a large number of
web pages with different formats, so I put in a lot of error checking,
along with informative errors.

        links = table.fetchText(name)
        if not links:
            raise BadTableMatch, "%s not found in table" % name
        td = links[0].findParent(['td', 'th'])
        if not td:
            raise BadmatchTable, "td/th not a parent of %s" % name
        next = td.nextSibling
        if not next:
            raise BadTableMatch, "td for %s has no sibling" % name
        out = get_contents(next)
        if not out:
            raise BadTableMatch, "no value string found for %s" % name
        return out

BeautifulSoup would raise exceptions if the conditions I check for are
true and I didn't check them - but the error messages wouldn't be as
informative.

Oh yeah - get_contents isn't from BeautifulSoup. I ran into cases
where the <td> tag held other tags, and wanted the flat text
extracted. Couldn't find a BeautifulSoup method to do that, so I wrote:

    def get_contents(ele):
        """Utility function to return all the text in a tag."""

        if ele.string:
            return ele.string		# We only have one string. Done
        return ''.join(get_contents(x) for x in ele)

        <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.