newbie re question

Bengt Richter bokr at oz.net
Wed Nov 6 23:58:59 EST 2002


On Wed, 06 Nov 2002 21:52:45 +0000, Gonçalo Rodrigues <op73418 at mail.telepac.pt> wrote:

>On Wed, 06 Nov 2002 21:17:12 GMT, "Fredrik Lundh"
><fredrik at pythonware.com> wrote:
>
>>Gonçalo Rodrigues wrote:
>>
>>> I've been trying to grok re's and settled myself a little exercise:
>>> concoct a re for a Python identifier.
>>>
>>> Now what I got is
>>>
>>> >>> pattern = re.compile(r'(\s|^)([\w_][\w\._]*)(\s|$)')
>>> >>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
>>> [('', 'aadf', ' '), (' ', 'b', ' '), (' ', 'aasa', ' '), (' ', 'aa', '> '), (' ', 'aa_aa', ' ')]
>>>
>>> But as you can see from the results, not all valid identifiers get
>>> caught. For example, why isn't 'cdase' caught?
>>
>>findall returns non-overlapping matches.  there's only a single space
>>between "aadf" and "cdase", and that was used by the first match.
>
>Typical newbie error - forgot that it consumed the next char. And I do
>want all the non overlapping matches.
>
>>
>>here's a better pattern:
>>
>>    pattern = re.compile(r'\b([a-zA-Z_]\w*)\b')
>>
>></F>
>>
>
>OK, I revamped a little your pattern but know I get too many matches,
>e.g.
>
>>>> pattern = re.compile(r'\b([a-zA-Z_][a-zA-Z_\.]*)\b')
>>>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
>['aadf', 'cdase', 'b', 'ad', 'aa', 'aasa', 'a.aa', 'a', 'aa', '_aa',
>'_aafr', 'aa_aa', 'aa__a', 'jk']
>
>I want the re to reject 'ad:aa', 'a@', etc. which are not valid
>identifiers. In the first case it returned 'ad' and 'aa' because \b
>matched '@' and I do *not* want that.
>
>I also tried using a not-match-lookahead as in
>
>>>> pattern = re.compile(r'([a-zA-Z_][a-zA-Z_\.]*)(?![^a-zA-Z_\.\s])')
>>>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
>['aadf', 'cdase', 'b', 'a', 'aa', 'aasa', 'a.aa', 'aa', '_aa', '_aaf',
>'aa_aa', 'aa__', 'jk']
>
>But as you see, it does not work either. In 'ad:aa' returns the matches
>'a' and 'aa' - and I understand why it does - it just keeps backtracking
>until you get match.
>
>I'm beaten. Can any1 help me out here?
>
Does this do what you want? (I don't know what other characters you are concerned with
besides [:@?])

 >>> pattern = re.compile(r'\b((?<![:@?])[a-zA-Z_][a-zA-Z_\.]*(?![:@?]))\b')
 >>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa aa__a?jk')
 ['aadf', 'cdase', 'b', 'aasa', 'a.aa', 'aa', '_aa', 'aa_aa']

Regards,
Bengt Richter



More information about the Python-list mailing list