Python regular expression question!

Ant antroy at gmail.com
Wed Sep 20 15:36:56 EDT 2006


unexpected wrote:
> > \b matches the beginning/end of a word (characters a-zA-Z_0-9).
> > So that regex will match e.g. MULTX-FOO but not MULTX-.
> >
>
> So is there a way to get \b to include - ?

No, but you can get the behaviour you want using negative lookaheads.
The following regex is effectively \b where - is treated as a word
character:

pattern = r"(?![a-zA-Z0-9_-])"

This effectively matches the next character that isn't in the group
[a-zA-Z0-9_-] but doesn't consume it. For example:

>>> p = re.compile(r".*?(?![a-zA-Z0-9_-])(.*)")
>>> s = "aabbcc_d-f-.XXX YYY"
>>> m = p.search(s)
>>> print m.group(1)
.XXX YYY

Note that the regex recognises the '.' as the end of the word, but
doesn't use it up in the match, so it is present in the final capturing
group. Contrast it with:

>>> p = re.compile(r".*?[^a-zA-Z0-9_-](.*)")
>>> s = "aabbcc_d-f-.XXX YYY"
>>> m = p.search(s)
>>> print m.group(1)
XXX YYY

Note here that "[^a-zA-Z0-9_-]" still denotes the end of the word, but
this time consumes it, so it doesn't appear in the final captured group.




More information about the Python-list mailing list