[Tutor] re module / separator

Kent Johnson kent37 at tds.net
Wed Jun 24 23:22:03 CEST 2009


On Wed, Jun 24, 2009 at 2:24 PM, Tiago Saboga<tiagosaboga at gmail.com> wrote:
> Hi!
>
> I am trying to split some lists out of a single text file, and I am
> having a hard time. I have reduced the problem to the following one:
>
> text = "a2345b. f325. a45453b. a325643b. a435643b. g234324b."
>
> Of this line of text, I want to take out strings where all words start
> with a, end with "b.". But I don't want a list of words. I want that:
>
> ["a2345b.", "a45453b. a325643b. a435643b."]
>
> And I feel I still don't fully understand regular expression's logic. I
> do not understand the results below:
>
> In [33]: re.search("(a[^.]*?b\.\s?){2}", text).group(0)
> Out[33]: 'a45453b. a325643b. '

group(0) is the entire match so this returns what you expect. But what
is group(1)?

In [6]: re.search("(a[^.]*?b\.\s?){2}", text).group(1)
Out[6]: 'a325643b. '

Repeated groups are tricky; the returned value contains only the first
match for the group, not the repeats.

> In [34]: re.findall("(a[^.]*?b\.\s?){2}", text)
> Out[34]: ['a325643b. ']

When the re contains groups, re.findall() returns the groups. It
doesn't return the whole match. So this is giving group(1), not
group(0). You can get the whole match by explicitly grouping it:

In [4]: re.findall("((a[^.]*?b\.\s?){2})", text)
Out[4]: [('a45453b. a325643b. ', 'a325643b. ')]

> In [35]: re.search("(a[^.]*?b\.\s?)+", text).group(0)
> Out[35]: 'a2345b. '

You only get the first match, so this is correct.

> In [36]: re.findall("(a[^.]*?b\.\s?)+", text)
> Out[36]: ['a2345b. ', 'a435643b. ']

This is finding both matches but the grouping has the same difficulty
as the previous findall(). This is closer:

In [7]: re.findall("((a[^.]*?b\.\s?)+)", text)
Out[7]: [('a2345b. ', 'a2345b. '), ('a45453b. a325643b. a435643b. ',
'a435643b. ')]

If you change the inner parentheses to be non-grouping then you get
pretty much what you want:

In [8]: re.findall("((?:a[^.]*?b\.\s?)+)", text)
Out[8]: ['a2345b. ', 'a45453b. a325643b. a435643b. ']

Kent


More information about the Tutor mailing list