how can I extract all urls in a string by using re.findall() ?

could ildg could.net at gmail.com
Thu Apr 7 03:26:12 EDT 2005


That's it! Thank you~~

On Apr 7, 2005 11:29 AM, Sidharth Kuruvila <sidharth.kuruvila at gmail.com> wrote:
> Reading the documentation on re might be helpfull here :-P
> 
> findall returns a tuple of all the groups in each match.
> 
> You might find finditer usefull.
> 
> for m in re.finditer(url, html) :
>     print m.group()
> 
> or you could replace all your paranthesis with the non-grouping
> version. That is, all brackets (...) with (?:...)
> 
> 
> On Apr 7, 2005 7:35 AM, could ildg <could.net at gmail.com> wrote:
> > I want to retrieve all urls in a string. When I use re.fiandall, I get
> > a list of tuples.
> > My code is like below:
> >
> > [code]
> > url=unicode(r"((http|ftp)://)?(((([\d]+\.)+){3}[\d]+(/[\w./]+)?)|([a-z]\w*((\.\w+)+){2,})([/][\w.~]*)*)")
> > m=re.findall(url,html)
> > for i in m:
> >    print i
> > [/code]
> >
> > html is a variable of string type which contains many urls in it.
> > the code will print many tuples, and each tuple seems not to represent
> > a url. e.g, one of them is as below:
> >
> > (u'http://', u'http', u'image.zhongsou.com/image/netchina.gif', u'',
> > u'', u'', u'', u'image.zhongsou.com', u'.com', u'.com',
> > u'/netchina.gif')
> >
> > Why is there two "http" in it? and why are there so many ampty strings
> > in the tupe above? It's obviously not a url. How can I get the urls
> > correctly?
> >
> > Thanks in advance.
> > --
> > 鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。
> > 直到有一天,我才发觉,我是鹦鹉。
> > 我是翻墙的鹦鹉。
> > --
> > http://mail.python.org/mailman/listinfo/python-list
> >
> 
> --
> http://blogs.applibase.net/sidharth
> --
> http://mail.python.org/mailman/listinfo/python-list
> 


-- 
鹦鹉聪明绝顶、搞笑之极,是人类的好朋友。
直到有一天,我才发觉,我是鹦鹉。
我是翻墙的鹦鹉。



More information about the Python-list mailing list