[Doc-SIG] What counts as a url?

M.-A. Lemburg mal@lemburg.com
Fri, 16 Mar 2001 18:08:37 +0100


"Edward D. Loper" wrote:
> 
> So I'm working on adding HREFs to STminus.  They look like this::
> 
>     "anchor name":URL
> 
> Where URL is either a relative URL or an absolute URL..  So I went
> and looked up "RFC 2396":http://www.w3.org/Addressing/rfc2396.txt .
> It suggests (if I'm reading it correctly) that we could define
> a URL as::
> 
>     ([a-zA-Z0-9-_.!~*'();/?:@&=+$,#] | %[0-9a-fA-F][0-9a-fA-F])+
> 
> Should we use that regexp for URLs?  Or perhaps we should go for
> simplicitly, and say that the regexp ends at whitespace::
> 
>     [^\s]+
> 
> In either case, we'll have to be careful to say::
> 
>     See "this":http://url .
> 
> instead of::
> 
>     See "this":http://url.
> 
> (the '.' gets included in the second url).  Is that a problem?  If
> so, what can we do about it? (Keep in mind that it *is* acceptable
> to have a URL that ends in a '.')..
> 
> Of course, I don't think people will be including HREFs in their
> documentation much, anyway.. So the main issue for most people
> will just be that they can't use '":' in certain environments..
> 
> Ideas/thoughts?

FYI, I use this RE in my apps:

	r'\b((?:http|ftp|https|mailto)://[\w@&#-_.!~*();]+\b/?)'

I don't think it makes sense to include schemes which are not
supported by your everyday browser, so only the most common ones
are included.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Pages:                           http://www.lemburg.com/python/