[Tutor] RE Silliness
Kent Johnson
kent37 at tds.net
Mon Jan 5 17:45:56 CET 2009
On Mon, Jan 5, 2009 at 11:16 AM, Omer <Jaggojaggo+Py at gmail.com> wrote:
> Bob, I tried your way.
>
>>>> import re
>>>> urlMask = r"http://[\w\Q./\?=\R]+(<br>)?"
>>>> text=u"Not working example<br>http://this.is.a/url?header=null<br>And
>>>> another line<br>http://and.another.url"
>>>> re.findall(urlMask,text)
> [u'<br>', u'']
>
> spir, I did understand it. What I'm not understanding is why isn't this
> working.
There is a bit of a gotcha in re.findall() - its behaviour changes
depending on whether there are groups in the re. If the re contains
groups, re.findall() only returns the matches for the groups.
If you enclose the entire re in parentheses (making it a group) you
get a better result:
In [2]: urlMask = r"(http://[\w\Q./\?=\R]+(<br>)?)"
In [3]: text=u"Not working
example<br>http://this.is.a/url?header=null<br>And another
line<br>http://and.another.url"
In [4]: re.findall(urlMask,text)
Out[4]:
[(u'http://this.is.a/url?header=null<br>', u'<br>'),
(u'http://and.another.url', u'')]
You can also use non-grouping parentheses around the <br>:
In [5]: urlMask = r"http://[\w\Q./\?=\R]+(?:<br>)?"
In [6]: re.findall(urlMask,text)
Out[6]: [u'http://this.is.a/url?header=null<br>', u'http://and.another.url']
Kent
More information about the Tutor
mailing list