newbie re question

Robin Munn rmunn at pobox.com
Thu Nov 7 10:53:19 EST 2002


On Thu, 07 Nov 2002 at 04:58 GMT, Bengt Richter <bokr at oz.net> wrote:
> On Wed, 06 Nov 2002 21:52:45 +0000, Gonçalo Rodrigues <op73418 at mail.telepac.pt> wrote:
> 
>>On Wed, 06 Nov 2002 21:17:12 GMT, "Fredrik Lundh"
>><fredrik at pythonware.com> wrote:
>>
>>>Gonçalo Rodrigues wrote:
>>>
>>>> I've been trying to grok re's and settled myself a little exercise:
>>>> concoct a re for a Python identifier.
>>>>
>>>> Now what I got is
>>>>
>>>> >>> pattern = re.compile(r'(\s|^)([\w_][\w\._]*)(\s|$)')
>>>> >>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
>>>> [('', 'aadf', ' '), (' ', 'b', ' '), (' ', 'aasa', ' '), (' ', 'aa', '> '), (' ', 'aa_aa', ' ')]
>>>>
>>>> But as you can see from the results, not all valid identifiers get
>>>> caught. For example, why isn't 'cdase' caught?
>>>
>>>findall returns non-overlapping matches.  there's only a single space
>>>between "aadf" and "cdase", and that was used by the first match.
>>
>>Typical newbie error - forgot that it consumed the next char. And I do
>>want all the non overlapping matches.
>>
>>>
>>>here's a better pattern:
>>>
>>>    pattern = re.compile(r'\b([a-zA-Z_]\w*)\b')
>>>
>>></F>
>>>
>>
>>OK, I revamped a little your pattern but know I get too many matches,
>>e.g.
>>
>>>>> pattern = re.compile(r'\b([a-zA-Z_][a-zA-Z_\.]*)\b')
>>>>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
>>['aadf', 'cdase', 'b', 'ad', 'aa', 'aasa', 'a.aa', 'a', 'aa', '_aa',
>>'_aafr', 'aa_aa', 'aa__a', 'jk']
>>
>>I want the re to reject 'ad:aa', 'a@', etc. which are not valid
>>identifiers. In the first case it returned 'ad' and 'aa' because \b
>>matched '@' and I do *not* want that.
>>
>>I also tried using a not-match-lookahead as in
>>
>>>>> pattern = re.compile(r'([a-zA-Z_][a-zA-Z_\.]*)(?![^a-zA-Z_\.\s])')
>>>>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
>>['aadf', 'cdase', 'b', 'a', 'aa', 'aasa', 'a.aa', 'aa', '_aa', '_aaf',
>>'aa_aa', 'aa__', 'jk']
>>
>>But as you see, it does not work either. In 'ad:aa' returns the matches
>>'a' and 'aa' - and I understand why it does - it just keeps backtracking
>>until you get match.
>>
>>I'm beaten. Can any1 help me out here?
>>
> Does this do what you want? (I don't know what other characters you are concerned with
> besides [:@?])
> 
> >>> pattern = re.compile(r'\b((?<![:@?])[a-zA-Z_][a-zA-Z_\.]*(?![:@?]))\b')
> >>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
>  ['aadf', 'cdase', 'b', 'aasa', 'a.aa', 'aa', '_aa', 'aa_aa']
> 
> Regards,
> Bengt Richter

This whole discussion is starting to remind me of a quote that I used to
have hanging on my cubicle wall:

    Some people, when confronted with a problem, think "I know, I'll use
    regular expressions." Now they have two problems.
                                                       - Jamie Zawinski

Are re's really what you need to use here? Correct me if I'm wrong, but
it looks like what you're trying to do is split the string into words
with a regexp, then apply rules about which characters may or may not be
part of the word. For the first part of that, why not use ''.split()?
Then search the list of words with a much simpler regexp:

>>> import re
>>> pattern = re.compile(r'^[a-zA-Z_][a-zA-Z_\.]*$')
>>> list_to_search = 'aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk'.split()
>>> list_to_search
['aadf', 'cdase', 'b', 'ad:aa', 'aasa', 'a.aa', 'a@', 'aa', '_aa', '_aafr@', 'aa_aa', 'aa__a?jk']
>>> result_list = [item for item in list_to_search if pattern.match(item)]
>>> result_list
['aadf', 'cdase', 'b', 'aasa', 'a.aa', 'aa', '_aa', 'aa_aa']

(My apologies for the more-than-80-chars lines here).

My rule of thumb: if a regular expression takes more than ten seconds to
grok *in its entirety*, it's too complicated, and another solution
should be sought. Now sometimes there won't be an elegant solution and
you wind up having to use a complicated regexp, but usually just trying
to think about other solutions will help clarify the problem in your
mind.

BTW, the Jamie Zawinski quote above came from a Slashdot discussion:

    http://slashdot.org/articles/99/06/01/2122209.shtml

Scroll up about one or two pages from the bottom of the article to find
the post in which that quote appears.

-- 
Robin Munn <rmunn at pobox.com>
http://www.rmunn.com/
PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838



More information about the Python-list mailing list