[Tutor] list all links with certain extension in an html file python

Fri Sep 28 15:59:12 CEST 2012

Santosh Kumar, 16.09.2012 09:20:
> I want to extract (no I don't want to download) all links that end in
> a certain extension.
> 
> Suppose there is a webpage, and in the head of that webpage there are
> 4 different CSS files linked to external server. Let the head look
> like this:
> 
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part1.css">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part2.css">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part3.css">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part4.css">
> 
> Please note that I don't want to download those CSS, instead I want
> something like this (to stdout):
> 
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
> 
> Also I don't want to use external libraries.

That's too bad because lxml.html would make this really easy. See the
iterlinks() method here:

http://lxml.de/lxmlhtml.html#working-with-links

Note this this also handles links in embedded CSS code etc., although you
might not be interested in that, if the example above is representative for
your task.

Stefan