strip away html tags from extracted links

Max Cuban edzeame at gmail.com
Fri Nov 29 11:56:08 EST 2013


I have the following code to extract certain links from a webpage:

from bs4 import BeautifulSoup
import urllib2, sys
import re

def tonaton():
    site = "http://tonaton.com/en/job-vacancies-in-ghana"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    invalid_tag = ('h2')
    soup = BeautifulSoup(jobpass)
    print soup.find_all('h2')

The links are contained in the 'h2' tags so I get the links as follows:

<h2><a href="/en/cashiers-accra">cashiers </a></h2>
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>

But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:

<a href="/en/cashiers-accra">cashiers </a>
<a href="/en/cake-baker-accra">Cake baker</a>
<a href="/en/automobile-technician-accra">Automobile Technician</a>
<a href="/en/marketing-officer-accra-4">Marketing Officer</a>
 

I therefore updated my code to look like this:

def tonaton():
    site = "http://tonaton.com/en/job-vacancies-in-ghana"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    invalid_tag = ('h2')
    soup = BeautifulSoup(jobpass)
    jobs = soup.find_all('h2')
    for tag in invalid_tag:
        for match in jobs(tag):
            match.replaceWithChildren()
    print jobs

But I couldn't get it to work, even though  I thought that was the best logic i could come up with.I'm a newbie though so I know there is something better that could be done.

Any help will be gracefully appreciated

Thanks



More information about the Python-list mailing list