[Tutor] list all links with certain extension in an html file python

Oscar Benjamin oscar.j.benjamin at gmail.com
Fri Sep 28 14:09:19 CEST 2012


On 16 September 2012 08:20, Santosh Kumar <sntshkmr60 at gmail.com> wrote:

> I want to extract (no I don't want to download) all links that end in
> a certain extension.
>
> Suppose there is a webpage, and in the head of that webpage there are
> 4 different CSS files linked to external server. Let the head look
> like this:
>
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part1.css
> ">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part2.css
> ">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part3.css
> ">
>     <link rel="stylesheet" type="text/css" href="http://foo.bar/part4.css
> ">
>
> Please note that I don't want to download those CSS, instead I want
> something like this (to stdout):
>
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
>     http://foo.bar/part1.css
>
> Also I don't want to use external libraries. I am asking for: which
> libraries and functions should I use?
>

If you don't want to use any third-party libraries then the standard
library has a module urllib2 for downloading a html file and htmlparser for
parsing it:
http://docs.python.org/library/urllib2.html#examples
http://docs.python.org/library/htmlparser.html#example-html-parser-application

Oscar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120928/22933fcb/attachment.html>


More information about the Tutor mailing list