Example Script to parse web page links and extract data?

Steven Bethard steven.bethard at gmail.com
Wed Sep 14 18:53:21 EDT 2005


livin wrote:
> I'm looking for an easy way to automate the below web site browsing and pull 
> the data I'm searching for.

This is a task that BeautifulSoup[1] is usually good for.

> 4) After search, table shows many links (hundreds sometimes) to the actual 
> data I need.
>     Links are this format... <a href="javascript:GetAgent('AA059')">
> 
> 5) Each link opens new window with table providing required data.
>     The URLs that each href opens is this... 
> http://armls.marketlinx.com/Roster/Scripts/Member.asp?PubID=AA059 where the 
> PubID is record I need.

I'm not entirely sure I got your problem description right, but I think 
points 4 and 5 would look something like:

base_url = 'http://armls.marketlinx.com/.../Member.asp?PubID=AA059'
html = urllib.urlopen(base_url).read()
soup = BeautifulSoup.BeautifulSoup(html)

link_matcher = re.compile(r'javascript:GetAgent('[^']*')
for link_elem in soup('a', {'href': link_matcher}):
     ...

HTH,

STeVe

[1] http://www.crummy.com/software/BeautifulSoup/



More information about the Python-list mailing list