Python Web Scrapping : Within href readonly those value that have href in it

Peter Otten __peter__ at web.de
Sat Jan 14 03:44:33 EST 2017


shahsn11 at gmail.com wrote:

> I am trying to scrape a webpage just for learning. In that webpage there
> are multiple "a" tags. consider the below code
> 
> <a href='\abc\def\jkl'> Something </a>
> 
> <a href ='http:\\www.google.com'> Something</a>

These are probaly all forward slashes.

> Now i want to read only those href in which there is http. My Current code
> is
> 
> for link in soup.find_all("a"):
>     print link.get("href")
> 
> i would like to change it to read only http links.

You mean href values that start with "http://"?
While you can do that with a callback

def check_scheme(href):
    return href is not None and href.startswith("http://")

for a in soup.find_all("a", href=check_scheme):
    print(a["href"])

or a regular expression

import re

for a in soup.find_all("a", href=re.compile("^http://")):
    print(a["href"])

why not keep things simple and check before printing? Like

for a in soup.find_all("a"):
    href = a.get("href", "") # empty string if href is missing
    if href.startswith("http://"):
        print(href)





More information about the Python-list mailing list