web scraping help / better way to do it ?

Tue Jan 19 05:29:33 EST 2016

Matt wrote:

> Beginner python user (3.5) and trying to scrape this page and get the
> ladder
> -   www.afl.com.au/ladder .  Its dynamic content so I used lynx -dump to
> get
> a  txt file and parsing that.
> 
> Here is the code
> 
> # import lynx -dump txt file
> f = open('c:/temp/afl2.txt','r').read()
> 
> # Put import txt file into list
> afl_list = f.split(' ')
> 
> #here are the things we want to search for
> search_list = ['FRE', 'WCE', 'HAW', 'SYD', 'RICH', 'WB', 'ADEL', 'NMFC',
> 'PORT', 'GEEL', 'GWS', 'COLL', 'MELB', 'STK', 'ESS', 'GCFC', 'BL', 'CARL']
> 
> def build_ladder():
>     for l in search_list:
>         output_num = afl_list.index(l)
>         list_pos = output_num -1
>         ladder_pos = afl_list[list_pos]
>         print(ladder_pos + ' ' + '-' + ' ' + l)
> 
> build_ladder()
> 
> 
> Which outputs this.
> 
> 1 - FRE
> 2 - WCE
> 3 - HAW
> 4 - SYD
> 5 - RICH
> 6 - WB
> 7 - ADEL
> 8 - NMFC
> 9 - PORT
> 10 - GEEL
> * - GWS
> 12 - COLL
> 13 - MELB
> 14 - STK
> 15 - ESS
> 16 - GCFC
> 17 - BL
> 18 - CARL
> 
> Notice that number 11 is missing because my script picks up "GWS" which is
> located earlier in the page.  What is the best way to skip that (and get
> the "GWS" lower down in the txt file) or am I better off approaching the
> code in a different way?

If you look at the html source you'll see that the desired "GWS" is inside a 
table, together with the other abbreviations. To extract (parts of) that 
table you should use a tool that understands the structure of html.

The most popular library to parse html with Python is BeautifulSoup, but my 
example uses lxml:

$ cat ladder.py
import urllib.request
import io
import lxml.html

def first(row, xpath):
    return row.xpath(xpath)[0].strip()

html = urllib.request.urlopen("http://www.afl.com.au/ladder").read()
tree = lxml.html.parse(io.BytesIO(html))

for row in tree.xpath("//tr")[1:]:
    print(
        first(row, ".//td[1]/span/text()"),
        first(row, ".//abbr/text()"))

$ python3 ladder.py
1 FRE
2 WCE
3 HAW
4 SYD
5 RICH
6 WB
7 ADEL
8 NMFC
9 PORT
10 GEEL
11 GWS
12 COLL
13 MELB
14 STK
15 ESS
16 GCFC
17 BL
18 CARL

Someone with better knowledge of XPath could probably avoid some of the 
postprocessing I do in Python.