[Doc-SIG] What counts as a url?

Edward Welbourne Edward Welbourne <eddy@chaos.org.uk>
Sat, 17 Mar 2001 17:14:09 +0000 (GMT)


See the standard module urlparse.

What we should really do is add, to the standard module urlparse, a
function which takes a string and returns the length of the initial
segment which is a URL, -ve value on failure (albeit the value is
guaranteed to be > 3 if it's not -ve; but -ve, or maybe 0, is the right
answer for compatibility with existing string/regex match/find routines)
so that the caller can then snip off this chunk and pass it to
urlparse.urlparse() ... erm hello, the doc page (at 1.5.2) mentions a
`parameters' chunk following a semicolon.  Never heard of it myself:

<html>
This corresponds to the general structure of a URL:
<tt><em>scheme</em>://<em>netloc</em>/<em>path</em>;<em>parameters</em>?<em>query</em>#<em>fragment</em></tt>.
</html>

ick - it doesn't cut netloc into name:port, let alone supply scheme's
default for port when omitted.  It also doesn't specify query-parsing
(OK, so that ends up making contentious design decisions, so fair
enough) or url-decoding (OK, so one can't do that to any unless also to
the query fragments, which implies parsing the query first).
Ho hum.

>>     ([a-zA-Z0-9-_.!~*'();/?:@&=+$,#] | %[0-9a-fA-F][0-9a-fA-F])+
>	r'\b((?:http|ftp|https|mailto)://[\w@&#-_.!~*();]+\b/?)'
erm ...

I'm fairly sure you're allowed at most one # and at most one ? in an
URL: any others *must* be url-encoded as %[0-9A-Fa-f]{2} tokens.  I'm
fairly sure you aren't allowed an & before the ? and that the # has to
appear after the ? and all &

Marc's regex doesn't mention = and ? explicitly, but they're definitely
allowed in URLs.  Are () really allowed in URLs ?  How about {} and [] ?
I'm fairly sure : and , are allowed in paths.  But I'd expect :,{}()[]*!
all to be url-endoced, anyway, so they shouldn't appear in the regexen;
they're covered by % and \w.

There is an RFC for URIs, I mailed it to Edward recently;
I guess that'd be
>> and looked up "RFC 2396":http://www.w3.org/Addressing/rfc2396.txt .
so go read the appendices (pedantically).

I know the relevant RFC has a helpful Appendix A giving BNF and Appendix
B advising how to parse, complete with a regex for parsing (which
presumes you *check* separately, based on the BNF).

I really don't like that space between the URL and the full-stop (sorry,
`period', to translate into North American Anglic); but, no, I can't see
how to avoid it.  Other than to treat the end of a URL as `this may have
been the end of a sentence', even if it isn't followed by a . so authors
of doc-strings know they can treat the URL as sentence-end (unconvinced).

oh - Mark:
> 	r'\b((?:http|ftp|https|mailto)://[\w@&#-_.!~*();]+\b/?)'
did you really mean `from # to _ inclusive'   ^^^ or did you mean to say
`#, - or _' ?  Hmm, I think you mean the latter: put - last in the [...]
But the latter reading claims you've missed out / in the path, and the
former claims most entries in your [...] are duplicates of ones in the
#-_ range.  I'm confused.

If the - is last, and we mention = explicitly, we can phrase the
character class as [...=;-] with its hair standing on end, which seems
entirely appropriate.


<sigh>.  Why do regexes always feel like the right answer to the wrong
question - albethey useful - ?
Edward: will working in EBNF spare us these messes ?
(I'm assuming that's Extended Baccus-Nauer Form, subject to spelling.)
Tibs: would mxTextTools let us say this stuff less uglily ?

I'm inclined to advise running with the way the RFC's appendices'
approach the problem, though: first, parse according to Appendix B's
regex, then (it explains better than I can here) take the fragments into
which it's cut your putative URL text and check each fragment for
validity according to the appropriate rules in appendix A, which depend
on the scheme; if any fail their check, decide that this wasn't a URL
anyway.  Albeit this means fully parsing the URL, so maybe the right
function to add to urlparse is one which reads, from a string, the
longest initial chunk which is a URL, returning a tuple whose first item
is the length, remainder are urlparse.urlparse()'s answers (at least
when the length is positive).


> I don't think it makes sense to include schemes which are not
> supported by your everyday browser, so only the most common ones
> are included.
I think it does make sense to include them, for two reasons:

  i) we should *recognise that the text is a URL* even when we know not
     what to do with it, if only so we can warn the user - the principle
     of least surprise says that if you *have to* surprise the user
     (whose browser does know about a scheme you're ignoring) you should
     at least have the decency to warn.

 ii) forward compatibility - someone may add a scheme that really does
     deserve to be in there, and the tool should need minimal revision
     to cope.

and I thought most browsers *did* cope with gopher, which you omit ...
Yes, I admit it, I'm an old fuddy-duddy.

But the right answer is to use the urlparse module, not to ad-hock
together your own; if you don't like how urlparse does things, fix it.
(Note: I'm as guilty as anyone on this - I'd written a much longer
version of this e-mail, complete with my own od-hack regex, before even
thinking to look for a module, at which point I instantly *knew* the
module was bound to exist - and not be to my liking.)

	Eddy.