newbie re question

Wed Nov 6 16:40:20 EST 2002

It seems ':' and '@' are considered as a word breaks
Also, it seems the OP wants to capture 'a.aa'

a lookahead does it:

>>> pattern = re.compile(r'(?:\s|^)([\w_][\w\._]*)(?=\s|$)')
>>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa
aa__a?jk xxx')
['aadf', 'cdase', 'b', 'aasa', 'a.aa', 'aa', '_aa', 'aa_aa', 'xxx']


"Fredrik Lundh" <fredrik at pythonware.com> wrote in message
news:sFfy9.2868$h%5.101793 at newsb.telia.net...
> Gonçalo Rodrigues wrote:
>
> > I've been trying to grok re's and settled myself a little exercise:
> > concoct a re for a Python identifier.
> >
> > Now what I got is
> >
> > >>> pattern = re.compile(r'(\s|^)([\w_][\w\._]*)(\s|$)')
> > >>> pattern.findall('aadf cdase b ad:aa aasa a.aa a@ aa _aa _aafr@ aa_aa
aa__a?jk')
> > [('', 'aadf', ' '), (' ', 'b', ' '), (' ', 'aasa', ' '), (' ', 'aa', '>
'), (' ', 'aa_aa', ' ')]
> >
> > But as you can see from the results, not all valid identifiers get
> > caught. For example, why isn't 'cdase' caught?
>
> findall returns non-overlapping matches.  there's only a single space
> between "aadf" and "cdase", and that was used by the first match.
>
> here's a better pattern:
>
>     pattern = re.compile(r'\b([a-zA-Z_]\w*)\b')
>
> </F>
>
>