[Tutor] Problem using lxml

Martin A. Brown martin at linux-ip.net
Sat Aug 22 23:20:23 CEST 2015


Hi there Anthony,

> I'm pretty new to lxml but I pretty much thought I'd understood 
> the basics. However, for some reason, my first attempt at using it 
> is failing miserably.
>
> Here's the deal:
>
> I'm parsing specific page on Craigslist (
> http://joplin.craigslist.org/search/rea) and trying to retreive the text of
> each link on that page. When I do an "inspect element" in Firefox, a sample
> anchor link looks like this:
>
> <a href="/reb/5185592209.html" data-id="5185592209" class="hdrlnk">FIRST
> OPEN HOUSE TOMORROW 2:00pm-4:00pm!!! (8-23-15)</a>
>
> The code I'm using to try to get the link text is this:
>
> from lxml import html
> import requests
>
> page = requests.get("http://joplin.craigslist.org/search/rea")

You are missing something here that takes the page.content, parses 
it and creates variable called tree.

> titles = tree.xpath('//a[@title="hdrlnk"]/text()')

And, your xpath is incorrect.  Play with this in the interactive 
browser and you will be able to correct your xpath.  I think you 
will notice from the example anchor link above that the attribute of 
the <a/> HTML elements you want to grab is "class", not "title". 
Therefore:

   titles = tree.xpath('//a[@class="hdrlnk"]/text()')

Is probably closer.

> print titles
>
> The last line, where it supposedly will print the text of each anchor
> returns [].
>
> I can't seem to figure out what I'm doing wrong. lmxml seems pretty
> straightforward but I can't seem to get this down.

Again, I'd recommend playing with the data in an interactive console 
session.  You will be able to figure out exactly which xpath gets 
you the data you would like, and then you can drop it into your 
script.

Good luck,

-Martin

-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list