Issues a longer xpath expression

Jean-Michel Pichavant jeanmichel at sequans.com
Fri Feb 22 08:59:02 EST 2013


----- Original Message ----- 

> I am having issues with the urllib and lxml.html modules.
> Here is my original code: import urllib import lxml . html
> down = 'http://v.163.com/special/visualizingdata/' file = urllib .
> urlopen ( down ). read () root = lxml . html . document_fromstring (
> file ) xpath_str = "//div[@class='down s-fc3 f-fl']/a" urllist =
> root . xpath ( xpath_str ) for url in urllist : print url . get (
> "href" )
> When run, it returns this output: http :
> //mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4 http :
> //mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4
> But, when I change the line
> xpath_str='//div[@class="down s-fc3 f-fl"]//a'
> into
> xpath_str='//div[@class="col f-cb"]//div[@class="down s-fc3
> f-fl"]//a'
> that is to say, urllist = root . xpath ( '//div[@class="col
> f-cb"]//div[@class="down s-fc3 f-fl"]//a' )
> I do not receive any output. What is the flaw in this code?
> it is so strange that the shorter one can work,the longer one can
> not,they have the same xpath structure!

Are you sure this is somehow related to python ? It looks like you just have issue parsing the xml.

I know little about what you're trying to do but :

1/ you're overriding the built-in 'file' type
2/ your selector is probably wrong 'class="col f-cb"' will fail because in the document, the div class may be "col f-cb", "col  f-cb" (2 spaces) or "f-cb col" etc...
3/ your short selector will return all elements without regard for the parent, hence it is not sensible to the issue 2/

How to get all .mp4 links:


hrefList = root.xpath('//a[@href]')
mp4List =[ref for ref in hrefList if '.mp4' in ref.attrib.get('href','')]

mp4List
[<Element a at 8d7ee0c>,
 <Element a at 8d7eefc>,
 <Element a at 8d7ee6c>,
 <Element a at 8d7ed7c>,
 <Element a at 8d7ef8c>,
 <Element a at 8d7efbc>]

From this list you can access to parent and child informations.

for mp4 in mp4List:
  print mp4.get('href')

http://mov.bn.netease.com/movieMP4/2012/12/A/7/S8H1TH9A7.mp4
http://mov.bn.netease.com/movieMP4/2012/12/D/9/S8H1ULCD9.mp4
http://mov.bn.netease.com/movieMP4/2012/12/4/P/S8H1UUH4P.mp4
http://mov.bn.netease.com/movieMP4/2012/12/B/V/S8H1V8RBV.mp4
http://mov.bn.netease.com/movieMP4/2012/12/6/E/S8H1VIF6E.mp4
http://mov.bn.netease.com/movieMP4/2012/12/B/G/S8H1VQ2BG.mp4

cheers

JM


-- IMPORTANT NOTICE: 

The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


More information about the Python-list mailing list