Python re repetative matching

Francis Avila francisgavila at yahoo.com
Mon Dec 22 20:38:15 EST 2003


Rich wrote in message ...
>Im new to regex's and cant quite figure out how to get them to work, what
>I want is a tuple of all the matches from the regex.  Ive simplified my
>actual problem and still cant get it to work

For the following answers I assume you only feed one line at a time.  (If
this is an unacceptable restriction, things get uglier.)

First, try and think if you need re's.  Re's are always last resort.  In
this particular case, it seems to me that

s = "@5489 heel all and thumb toe"
s.split(' ', 1)

is all you need. If you need more precision (and the digit sequence is
always 4 chars long), the basic pattern is as follows:

re.split(r'(?<=@\d{4}) (?=.*)', s)

>Ive so far got this:
>print re.findall( r'(@\d+)|(\w+)', "@5489 heel all and thumb toe" )

You need nongrouping parens, and \w+ will split words.

Split to digits and words, discarding nothing:
re.findall(r'(?:@\d{4})|(?:.+)', s)

Split each item separately, discarding whitespace.
re.findall(r'(?:@\d{4})|(?:\w+)', s)

>I also tried my orginal idea
>
>a = re.match( r'(@\d+)\s+(\w+)', "@5489 heel all and thumb toe" )
>print a.groups()

re.match( r'(@\d+) (.+)', s ).groups()

>This matches the number and the first word, so I thought the following
>should rematch after the first word and give me what I wanted... but it
>dosent for some reason

It doesn't because '\w' means 'words', i.e. [1-9a-zA-Z_].  It doesn't match
spaces, so once it comes up against a space, it stops.

>
>a = re.match( r'(@\d+)\s+(?:(\w+)\s*)', "@5489 heel all and thumb toe" )
>print a.groups()

So you do know about nongrouping parens?  Anyway, this doesn't match after
the first word because it only matches words, not spaces.

>This is my next iteration, still gives me the number (first group) and
>only the word (the second match).  So I extend it to ...
>
>a = re.match( r'(@\d+)\s+(?:(\w+)\s*)*', "@5489 heel all and thumb toe" )
>print a.groups()
>
>Now this gives me the number and the last but one word ? WHY!

Because * does not magically make new groups.  It seems to me it should
match the last word, though, instead of next-to-last, but I won't think
about it too much because this re is hideous as it is, and shouldn't be
used.

>My logic suggests that this should do what I want... what am I missing,
>Ive spent all night trying to figure this out.

Your first error was using regular expressions:

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions".  Now they have two problems.'  --Jamie Zawinski,
    comp.lang.emacs

Use string methods, especially split().

Also, I am no longer sure whether you want all items/words to be groups
separately, or if you want one group of numbers, and the rest words.  Either
one is trivial for string methods:

s.split() for each in a group.
s.split(' ', 1) for only two groups.

However, the first one is impossible for REs (I think) if the number of
groups is variable, and ugly if the number of groups is fixed.  The second
one I've done ad nauseum here.

See the RE Howto:
http://www.amk.ca/python/howto/regex/

Also, there's an O'Reilly book "Mastering Regular Expressions" which is said
to be excellent.  Also Mertz wrote a "Text Processing with Python" (or
something like that) which is also said to be excellent.  Mertz also has a
bunch of online columns on Python, all of which are very good. But my guess
is that you don't really need any of these.
--
Francis Avila





More information about the Python-list mailing list