[Tutor] list all links with certain extension in an html file python
Stefan Behnel
stefan_ml at behnel.de
Fri Sep 28 15:59:12 CEST 2012
Santosh Kumar, 16.09.2012 09:20:
> I want to extract (no I don't want to download) all links that end in
> a certain extension.
>
> Suppose there is a webpage, and in the head of that webpage there are
> 4 different CSS files linked to external server. Let the head look
> like this:
>
> <link rel="stylesheet" type="text/css" href="http://foo.bar/part1.css">
> <link rel="stylesheet" type="text/css" href="http://foo.bar/part2.css">
> <link rel="stylesheet" type="text/css" href="http://foo.bar/part3.css">
> <link rel="stylesheet" type="text/css" href="http://foo.bar/part4.css">
>
> Please note that I don't want to download those CSS, instead I want
> something like this (to stdout):
>
> http://foo.bar/part1.css
> http://foo.bar/part1.css
> http://foo.bar/part1.css
> http://foo.bar/part1.css
>
> Also I don't want to use external libraries.
That's too bad because lxml.html would make this really easy. See the
iterlinks() method here:
http://lxml.de/lxmlhtml.html#working-with-links
Note this this also handles links in embedded CSS code etc., although you
might not be interested in that, if the example above is representative for
your task.
Stefan
More information about the Tutor
mailing list