Screen scraper to get all 'a title' elements

Wed Nov 25 15:55:56 EST 2015

On 2015-11-25 20:42, ryguy7272 wrote:
> Hello experts.  I'm looking at this url:
> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
>
> I'm trying to figure out how to list all 'a title' elements.  For instance, I see the following:
> <a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
> <a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
>
> So, I tried putting a script together to get 'title'.  Here's my attempt.
>
> import requests
> import sys
> from bs4 import BeautifulSoup
>
> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> source_code = requests.get(url)
> plain_text = source_code.text
> soup = BeautifulSoup(plain_text)
> for link in soup.findAll('title'):
>      print(link)
>
> All that does is get the title of the page.  I tried to get the links from that url, with this script.
>
A 'title' element has the form "<title ...>". What you should be looking 
for are 'a' elements, those of the form "<a ...>".

> import urllib2
> import re
>
> #connect to a URL
> website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
>
> #read html code
> html = website.read()
>
> #use re.findall to get all the links
> links = re.findall('"((http|ftp)s?://.*?)"', html)
>
> print links
>
> That doesn't work wither.  Basically, I'd like to see this.
>
> Accident
> Ala-Lemu
> Alert
> Apocalypse Peaks
> Athol
> Å
> Barbecue
> Båstad
> Bastardstown
> Batman
> Bathmen (Battem), Netherlands
> ...
> Worms
> Yell
> Zigzag
> Zzyzx
>
> How can I do that?
> Thanks all!!
>
>