Listing link urls

Kishore Kumar Alajangi akishorecert at gmail.com
Sun Oct 29 09:16:15 EDT 2017


+ tutor

On Sun, Oct 29, 2017 at 6:57 AM, Kishore Kumar Alajangi <
akishorecert at gmail.com> wrote:

> Hi,
>
> I am facing an issue with listing specific urls inside web page,
>
> https://economictimes.indiatimes.com/archive.cms
>
> Page contains link urls by year and month vise,
> Ex: /archive/year-2001,month-1.cms
>
> I am able to list all required urls using the below code,
>
> from bs4 import BeautifulSoup
> import re, csv
> import urllib.request
> import scrapy
> req = urllib.request.Request('http://economictimes.indiatimes.com/archive.cms', headers={'User-Agent': 'Mozilla/5.0'})
>
>
> links = []
> totalPosts = []
> url = "http://economictimes.indiatimes.com"
> data = urllib.request.urlopen(req).read()
> page = BeautifulSoup(data,'html.parser')
>
> for link in page.findAll('a', href = re.compile('^/archive/')): //retrieving urls starts with "archive"
>     l = link.get('href')
>     links.append(url+l)
>
>
> with open("output.txt", "a") as f:
>      for post in links:
>          post = post + '\n'
>          f.write(post)
>
> *sample result in text file:*
>
> http://economictimes.indiatimes.com/archive/year-2001,month-1.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-2.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-3.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-4.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-5.cmshttp://economictimes.indiatimes.com/archive/year-2001,month-6.cms
>
>
> List of urls I am storing in a text file, From the month urls I want to retrieve day urls starts with "/archivelist", I am using
>
> the below code, but I am not getting any result, If I check with inspect element the urls are available starting with /archivelist,
>
> <a href="/archivelist/year-2001,month-3,starttime=36951.cms"></a>
>
> Kindly help me where I am doing wrong.
>
> from bs4 import BeautifulSoup
> import re, csv
> import urllib.request
> import scrapy
>
> file = open("output.txt", "r")
>
>
> for i in file:
>
>     urls = urllib.request.Request(i, headers={'User-Agent': 'Mozilla/5.0'})
>
>     data1 = urllib.request.urlopen(urls).read()
>
>     page1 = BeautifulSoup(data1, 'html.parser')
>
>     for link1 in page1.findAll(href = re.compile('^/archivelist/')):
>
>         l1 = link1.get('href')
>
>         print(l1)
>
>
> Thanks,
>
> Kishore.
>
>
>
>
>
>



More information about the Python-list mailing list