Trying to parse matchup.io (lxml, SGMLParser, urlparse)

Jerry Rocteur jerry.rocteur at gmail.com
Sun Jan 18 07:07:37 EST 2015


Hi,

I'm trying to parse https://matchup.io/players/rocteur/friends

The body source I'm interested in contains blocks exactly like this

<tr class='friend'>
<td class='text--left'>
<a href="/players/mizucci0"><img alt="mizucci0" class="media__avatar"
src="https://matchup-io.s3.amazonaws.com/uploads/player/avatar/7651/7651_profile_150_square.jpeg"
/>
<div class='friend__info'>
<span>mizucci0</span>
<span>Mizuho</span>
</div>
</a></td>
<td class='delta-alt'>
29,646
<br>
steps
</td>
<td class='delta-alt'>
35,315
<br>
steps
</td>
<td class='delta-alt'>
818.7
<br>
Miles
</td>
</tr>

I wanted to do it Python as I'm learning and I looked at the different
modules but it isn't easy for me to work out the best way to do this
as most tutorials I see use complicated classes and I just want to
parse this one paragraph at a time (as I would do in Perl) and print

1 mizuho 26648 35315
2 xxxxxx  99999 99999
3 xxxxxx 99999 99999

etc. (in the above case I'm ignoring 818.7 and Miles.

The best way I found so far is this:

from lxml import html
import requests
page = requests.get("https://matchup.io/players/rocteur/friends/week/")
tree = html.fromstring(page.text)
a = tree.xpath('//span/text()')
b = tree.xpath('//td/text()')

And the manipulating indices

e.g.
print "%s %s %s %s" % (a[usern], a[users], b[tots], b[weekb])
    tots += 4
    weekb += 4
    usern += 2
    users += 2

But it isn't very scientific ;-)

Which module would you use and how would you suggest is the best way to do it ?

Thanks very much in advance, I haven't done a lot of HTML parsing.. I
would much prefer using WebServices and an API but unfortunately they
don't have it.
-- 
Jerry Rocteur



More information about the Python-list mailing list