Regex for URL extracting

Chris Mellon arkanes at gmail.com
Wed Jan 24 15:05:31 EST 2007


On 24 Jan 2007 11:07:49 -0800, Paul McGuire <ptmcg at austin.rr.com> wrote:
> On Jan 24, 10:20 am, "Johny" <pyt... at hope.cz> wrote:
> > Does anyone know about a good regular expression  for URL extracting?
> >
> > J.
> Google turns this up:
>
> http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx
>
> But I've seen other re's for this problem that are hundreds of
> characters long.
>
> -- Paul
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

These are the regexps that gnome-terminal uses for it's URL
auto-recognition, and I have shamelessly stolen them for use in one of
my own apps:

urlfinders = [
    re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]"),
    re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|(((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+)(:[0-9]*)?"),
    re.compile("(~/|/|\\./)([-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]|\\\\
)+"),
    re.compile("'\\<((mailto:)|)[-A-Za-z0-9\\.]+@[-A-Za-z0-9\\.]+"),
]



More information about the Python-list mailing list