[Tutor] Re: Re trouble

Mon Oct 27 17:15:29 EST 2003

Øyvind Dale Spørck wrote on Mon, 27 Oct 2003 20:42:46 +0100:

> Hello,
> 
>    I am using the Re module to filter out some webadresses out of html
> documents, but cannot seem to get it right. What should go in the paranteses
> of the re.search?
> 
> Here is an example from the html:
> 
> <a
> href="../../../../../../get.liste.kvakk.no/fs/http_3A/www.db.no/smurf/defaul
> t.htm"><b>Dagbladet AS</b></a> &#91;<a
> href="../../../../../../get.liste.kvakk.no/is/http_3A/testside.no/smurf/defa
> ult.htm"><font color="#CC3300"><b>Vis side</b></font></a>
> 
> In other words, I would like to get a list of these adresses:
> www.db.no/smurf/default.htm
> testside.no/smurf/default.htm
> 
> These adresses can be anything. I guess the common nominator is that they
> start after http_3A/ and ends before the first ". 

Here's a regex that can do that (pick the group with index 1 returned by
RE, it contains the address):

  http_3A/(.*)"

Won't get you very far if your requirements change, even slightly.

-- 
Yours,

Andrei

=====
Mail address in header catches spam. Real contact info (decode with rot13):
cebwrpg5 at bcrenznvy.pbz. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V
ernq gur yvfg, fb gurer'f ab arrq gb PP.