better regular expression?

Andy Gross andy at andygross.org
Mon Dec 6 19:51:46 EST 2004


Check out the 'urlparse' module, in the standard library, unless for  
some reason you *have* to use regular expressions.

/arg

On Dec 6, 2004, at 7:46 PM, Vivek wrote:

> Hi,
>
> I am trying to construct a regular expression using the re module that
> matches for
> 1. my hostname
> 2. absolute from the root URLs including just "/"
> 3. relative URLs.
>
> Basically I want the attern to not match for URLs that are not on my
> host.
>
> The following statement satisfies numbers 1 and 2, but not 3:
>
> line =
> re.sub(r'(href=")(http[s]?://'+hostname+'[/]? 
> |/)([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)
>
> An improvement that also partially satisfies number 3 is
>
> line =
> re.sub(r'(href=")(http[s]?://'+hostname+'[/]?|/ 
> |[^h][^t][^t][^p][^:][^/][^/])([^"]*?)(")',r'\1\2\3'+sInfo+r'\4',line)
>
> This is not complete because if the relative url is less than seven
> characters, than it will not match.
>
> Any suggestions?
>
> Thanx.
>
> -- 
> http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list