[Tutor] list all links with certain extension in an html file python

Stefan Behnel stefan_ml at behnel.de
Fri Sep 28 15:59:12 CEST 2012


Santosh Kumar, 16.09.2012 09:20:
> I want to extract (no I don't want to download) all links that end in
> a certain extension.
> 
> Suppose there is a webpage, and in the head of that webpage there are
> 4 different CSS files linked to external server. Let the head look
> like this:
> 
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part1.css">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part2.css">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part3.css">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part4.css">
> 
> Please note that I don't want to download those CSS, instead I want
> something like this (to stdout):
> 
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
> 
> Also I don't want to use external libraries.

That's too bad because lxml.html would make this really easy. See the
iterlinks() method here:

http://lxml.de/lxmlhtml.html#working-with-links

Note this this also handles links in embedded CSS code etc., although you
might not be interested in that, if the example above is representative for
your task.

Stefan




More information about the Tutor mailing list