Screen scraper to get all 'a title' elements
ryguy7272
ryanshuell at gmail.com
Wed Nov 25 15:42:00 EST 2015
Hello experts. I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
I'm trying to figure out how to list all 'a title' elements. For instance, I see the following:
<a title="Accident, Maryland" href="/wiki/Accident,_Maryland">Accident</a>
<a class="new" title="Ala-Lemu (page does not exist)" href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
<a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
<a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse Peaks</a>
So, I tried putting a script together to get 'title'. Here's my attempt.
import requests
import sys
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link)
All that does is get the title of the page. I tried to get the links from that url, with this script.
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
That doesn't work wither. Basically, I'd like to see this.
Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx
How can I do that?
Thanks all!!
More information about the Python-list
mailing list