writing a web client
Mike Meyer
mwm at mired.org
Fri Jul 29 17:19:11 EDT 2005
"Fuzzyman" <fuzzyman at gmail.com> writes:
> Ajar wrote:
>> I want to write a program which will automatically login to my ISPs
>> website, retrieve data and do some processing. Can this be done? Can
>> you point me to any example python programs which do similar things?
>>
>> Regards,
>> Ajar
>
> Very easily. Have a look at my article on the ``urllib2`` module.
>
> http://www.voidspace.org.uk/python/articles.shtml#http
>
> You may need to use ClientCookie/cookielib to handle cookies and may
> have to cope with BASIC authentication. There are also articles about
> both of these as well.
>
> If you want to handle filling in forms programattically then the module
> ClientForm is useful (allegedly).
The last piece of the puzzle is BeautifulSoup. That's what you use to
extract data from the web page.
For instance a lot of web pages listing data have something like this
on it:
<table>
...
<tr><th>Item:</th><td>Value</td></tr>
...
</table>
You can extract value from such with BeautifulSoup by doing something like:
soup.fetchText('Item:')[0].findParent(['td', 'th']).nextSibling.string
Where this checks works for the item being in either a td or th tag.
Of course, I recommend doing things a little bit more verbosely. In my
case, I'm writing code that's expected to work on a large number of
web pages with different formats, so I put in a lot of error checking,
along with informative errors.
links = table.fetchText(name)
if not links:
raise BadTableMatch, "%s not found in table" % name
td = links[0].findParent(['td', 'th'])
if not td:
raise BadmatchTable, "td/th not a parent of %s" % name
next = td.nextSibling
if not next:
raise BadTableMatch, "td for %s has no sibling" % name
out = get_contents(next)
if not out:
raise BadTableMatch, "no value string found for %s" % name
return out
BeautifulSoup would raise exceptions if the conditions I check for are
true and I didn't check them - but the error messages wouldn't be as
informative.
Oh yeah - get_contents isn't from BeautifulSoup. I ran into cases
where the <td> tag held other tags, and wanted the flat text
extracted. Couldn't find a BeautifulSoup method to do that, so I wrote:
def get_contents(ele):
"""Utility function to return all the text in a tag."""
if ele.string:
return ele.string # We only have one string. Done
return ''.join(get_contents(x) for x in ele)
<mike
--
Mike Meyer <mwm at mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
More information about the Python-list
mailing list