What's wrong with this subroutine?

Tim Legant tim-dated-1015305638.c02155 at catseye.net
Tue Feb 26 00:20:37 EST 2002


Daniel Yoo <dyoo at hkn.eecs.berkeley.edu> writes:

> Sounds good!  You might want to use a regular expression to detect
> this "url" pattern.  Here's a translation of Tom Christiansen's (of
> Perl fame) HTTP url regular expression:
> 
> 
> ###
> ## This is a regular expression that detects HTTP urls.
> ##
> ## This is only a small sample of tchrist's very nice tutorial on
> ## regular expressions.  See:
> ##
> ##     http://www.perl.com/doc/FMTEYEWTK/regexps.html
> ##
> ## for more details.
> 
> urls = '(%s)' % '|'.join("""http telnet gopher file wais ftp""".split())
> ltrs = r'\w'
> gunk = '/#~:.?+=&%@!\-'
> punc = '.:?\-'
> any = "%(ltrs)s%(gunk)s%(punc)s" % { 'ltrs' : ltrs,
>                                      'gunk' : gunk,
>                                      'punc' : punc }
> 
> url = r"""
>     \b                            # start at word boundary
>     (                             # begin \1 {
>         %(urls)s    :             # need resource and a colon
>         [%(any)s] +?              # followed by one or more
>                                   #  of any valid character, but
>                                   #  be conservative and take only
>                                   #  what you need to....
>     )                             # end   \1 }
>     (?=                           # look-ahead non-consumptive assertion
>             [%(punc)s]*           # either 0 or more punctuation
>             [^%(any)s]            #  followed by a non-url char
>         |                         # or else
>             $                     #  then end of the string
>     )
>     """ % {'urls' : urls,
>            'any' : any,
>            'punc' : punc }
> 
> url_re = re.compile(url, re.VERBOSE)

Hmmm.  Here's an URL found in "the wild" that doesn't work with this
regular expression:

http://www.wired.com/news/medtech/0,1286,50394,00.html

I don't know if commas are supposed to be encoded or not, but they're
fairly common on news sites and Christiansen's RE breaks on them.

Perhaps the OP doesn't really need to validate the URLs.  He can just
spit them back out in the chat script and let the chatters complain to
one another if someone posts a bogus URL. :)


Tim




More information about the Python-list mailing list