Trying to parse matchup.io (lxml, SGMLParser, urlparse)

Sun Jan 18 13:10:05 EST 2015

Jerry Rocteur wrote:

> Hi,
> 
> I'm trying to parse https://matchup.io/players/rocteur/friends
> 
> The body source I'm interested in contains blocks exactly like this
> 
> <tr class='friend'>
> <td class='text--left'>
> <a href="/players/mizucci0"><img alt="mizucci0" class="media__avatar"
> src="https://matchup-io.s3.amazonaws.com/uploads/player/avatar/7651/7651_profile_150_square.jpeg"
> />
> <div class='friend__info'>
> <span>mizucci0</span>
> <span>Mizuho</span>
> </div>
> </a></td>
> <td class='delta-alt'>
> 29,646
> <br>
> steps
> </td>
> <td class='delta-alt'>
> 35,315
> <br>
> steps
> </td>
> <td class='delta-alt'>
> 818.7
> <br>
> Miles
> </td>
> </tr>
> 
> I wanted to do it Python as I'm learning and I looked at the different
> modules but it isn't easy for me to work out the best way to do this
> as most tutorials I see use complicated classes and I just want to
> parse this one paragraph at a time (as I would do in Perl) and print
> 
> 1 mizuho 26648 35315
> 2 xxxxxx  99999 99999
> 3 xxxxxx 99999 99999
> 
> etc. (in the above case I'm ignoring 818.7 and Miles.
> 
> The best way I found so far is this:
> 
> from lxml import html
> import requests
> page = requests.get("https://matchup.io/players/rocteur/friends/week/")
> tree = html.fromstring(page.text)
> a = tree.xpath('//span/text()')
> b = tree.xpath('//td/text()')
> 
> And the manipulating indices
> 
> e.g.
> print "%s %s %s %s" % (a[usern], a[users], b[tots], b[weekb])
>     tots += 4
>     weekb += 4
>     usern += 2
>     users += 2
> 
> But it isn't very scientific ;-)

In my experience scraping data from a web page never is. The trick is to not 
waste too much time on your script once you have it working. The next 
overhaul of the scraped page is already on the way, and yes, it will heavily  
use javascript ;)

> Which module would you use and how would you suggest is the best way to do
> it ?

I think lxml is a good choice. Is there something with an API you prefer in 
Perl?

> Thanks very much in advance, I haven't done a lot of HTML parsing.. I
> would much prefer using WebServices and an API but unfortunately they
> don't have it.

PS: Here's my take:

import requests
import lxml.html

def get_html():
    return 
requests.get("https://matchup.io/players/rocteur/friends/week/").text

def fix(value):
    return value.text.strip().replace(",", "")

tree = lxml.html.fromstring(get_html())
for friend in tree.xpath('//tr[@class="friend"]'):
    values = friend.xpath('.//td[@class="delta-alt"]')
    print(
        friend.xpath('.//div/span[2]/text()')[0],
        fix(values[0]),
        fix(values[1])
    )