Screen scraper to get all 'a title' elements

ryguy7272 ryanshuell at gmail.com
Wed Nov 25 17:48:40 EST 2015


On Wednesday, November 25, 2015 at 5:30:14 PM UTC-5, Grobu wrote:
> Hi
> 
> It seems that links on that Wikipedia page follow the structure :
> <a href="..." title="...">
> 
> You could extract a list of link titles with something like :
> re.findall( r'\<a[^>]+title="(.+?)"', html )
> 
> HTH,
> 
> -Grobu-
> 
> 
> On 25/11/15 21:55, MRAB wrote:
> > On 2015-11-25 20:42, ryguy7272 wrote:
> >> Hello experts.  I'm looking at this url:
> >> https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
> >>
> >> I'm trying to figure out how to list all 'a title' elements.  For
> >> instance, I see the following:
> >> <a title="Accident, Maryland"
> >> href="/wiki/Accident,_Maryland">Accident</a>
> >> <a class="new" title="Ala-Lemu (page does not exist)"
> >> href="/w/index.php?title=Ala-Lemu&action=edit&redlink=1">Ala-Lemu</a>
> >> <a title="Alert, Nunavut" href="/wiki/Alert,_Nunavut">Alert</a>
> >> <a title="Apocalypse Peaks" href="/wiki/Apocalypse_Peaks">Apocalypse
> >> Peaks</a>
> >>
> >> So, I tried putting a script together to get 'title'.  Here's my attempt.
> >>
> >> import requests
> >> import sys
> >> from bs4 import BeautifulSoup
> >>
> >> url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names"
> >> source_code = requests.get(url)
> >> plain_text = source_code.text
> >> soup = BeautifulSoup(plain_text)
> >> for link in soup.findAll('title'):
> >>      print(link)
> >>
> >> All that does is get the title of the page.  I tried to get the links
> >> from that url, with this script.
> >>
> > A 'title' element has the form "<title ...>". What you should be looking
> > for are 'a' elements, those of the form "<a ...>".
> >
> >> import urllib2
> >> import re
> >>
> >> #connect to a URL
> >> website =
> >> urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
> >>
> >>
> >> #read html code
> >> html = website.read()
> >>
> >> #use re.findall to get all the links
> >> links = re.findall('"((http|ftp)s?://.*?)"', html)
> >>
> >> print links
> >>
> >> That doesn't work wither.  Basically, I'd like to see this.
> >>
> >> Accident
> >> Ala-Lemu
> >> Alert
> >> Apocalypse Peaks
> >> Athol
> >> Å
> >> Barbecue
> >> Båstad
> >> Bastardstown
> >> Batman
> >> Bathmen (Battem), Netherlands
> >> ...
> >> Worms
> >> Yell
> >> Zigzag
> >> Zzyzx
> >>
> >> How can I do that?
> >> Thanks all!!



Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that.  

Can you just please explain what it's doing???



More information about the Python-list mailing list