[Tutor] RE Silliness

Mon Jan 5 17:45:56 CET 2009

On Mon, Jan 5, 2009 at 11:16 AM, Omer <Jaggojaggo+Py at gmail.com> wrote:
> Bob, I tried your way.
>
>>>> import re
>>>> urlMask = r"http://[\w\Q./\?=\R]+(<br>)?"
>>>> text=u"Not working example<br>http://this.is.a/url?header=null<br>And
>>>> another line<br>http://and.another.url"
>>>> re.findall(urlMask,text)
> [u'<br>', u'']
>
> spir, I did understand it. What I'm not understanding is why isn't this
> working.

There is a bit of a gotcha in re.findall() - its behaviour changes
depending on whether there are groups in the re. If the re contains
groups, re.findall() only returns the matches for the groups.

If you enclose the entire re in parentheses (making it a group) you
get a better result:
In [2]: urlMask = r"(http://[\w\Q./\?=\R]+(<br>)?)"

In [3]: text=u"Not working
example<br>http://this.is.a/url?header=null<br>And another
line<br>http://and.another.url"

In [4]: re.findall(urlMask,text)
Out[4]:
[(u'http://this.is.a/url?header=null<br>', u'<br>'),
 (u'http://and.another.url', u'')]

You can also use non-grouping parentheses around the <br>:
In [5]: urlMask = r"http://[\w\Q./\?=\R]+(?:<br>)?"

In [6]: re.findall(urlMask,text)
Out[6]: [u'http://this.is.a/url?header=null<br>', u'http://and.another.url']

Kent