[Tutor] RE Silliness

Kent Johnson kent37 at tds.net
Mon Jan 5 17:45:56 CET 2009


On Mon, Jan 5, 2009 at 11:16 AM, Omer <Jaggojaggo+Py at gmail.com> wrote:
> Bob, I tried your way.
>
>>>> import re
>>>> urlMask = r"http://[\w\Q./\?=\R]+(<br>)?"
>>>> text=u"Not working example<br>http://this.is.a/url?header=null<br>And
>>>> another line<br>http://and.another.url"
>>>> re.findall(urlMask,text)
> [u'<br>', u'']
>
> spir, I did understand it. What I'm not understanding is why isn't this
> working.

There is a bit of a gotcha in re.findall() - its behaviour changes
depending on whether there are groups in the re. If the re contains
groups, re.findall() only returns the matches for the groups.

If you enclose the entire re in parentheses (making it a group) you
get a better result:
In [2]: urlMask = r"(http://[\w\Q./\?=\R]+(<br>)?)"

In [3]: text=u"Not working
example<br>http://this.is.a/url?header=null<br>And another
line<br>http://and.another.url"

In [4]: re.findall(urlMask,text)
Out[4]:
[(u'http://this.is.a/url?header=null<br>', u'<br>'),
 (u'http://and.another.url', u'')]

You can also use non-grouping parentheses around the <br>:
In [5]: urlMask = r"http://[\w\Q./\?=\R]+(?:<br>)?"

In [6]: re.findall(urlMask,text)
Out[6]: [u'http://this.is.a/url?header=null<br>', u'http://and.another.url']

Kent


More information about the Tutor mailing list