[Tutor] Why doesn't this regex match???

Sat, 09 Feb 2002 03:06:17 -0500

[Sheila King]
> Despite Tim's excellent advice, to reconsider whether I really want to
> use regular expressions, I persist at that task. (I've been programming
> long enough to feel that I ought to have learned them by now. And I'd
> like to start understanding them so when other people discuss them I
> don't feel lost.)

The illusion you're suffering is that when other people discuss them,
they're not also lost <wink>.  O'Reilly publishes an excellent book titled
"Mastering Regular Expressions"; it's the only really good intro I've ever
seen.

> So, I decided I could possibly solve my problem with the whole "word
> boundary" thing (when my phrase-pattern doesn't begin and end in an
> alphanumeric character) as follows:
>
> Match on any of the following at the beginning of the phrase-pattern:
>
> The beginning of the searched string
> A word boundary
> White space
>
> and any of the following at the end of the phrase-pattern:
>
> The end of the searched string
> A word boundary
> White space
>
> Seems to me, that this should about cover things???

That's part of the problem:  defining what you want to match, exactly.  The
other part is spelling that with regexps.

> So I tried the following, with the dismal results shown. Now what am I
> doing wrong?

As before, the best way to proceed is to simplify the regexp until you stop
having problems, and then make it more complicated again one step at a time.
In fact you've got several problems in this attempt, and it's so much harder
to wrap your brain around all of them in one gulp.

> >>> searchstring = 'ADV: FREE FREE OFFERZ!!!!'
> >>> word = 'adv:'
> >>> p = re.compile(r'[\b\A\s]%s[\b\Z\s]' % word, re.I)
> Traceback (most recent call last):
>   File "<pyshell#45>", line 1, in ?
>     p = re.compile(r'[\b\A\s]%s[\b\Z\s]' % word, re.I)
>   File "E:\PYTHON\PYTHON22\lib\sre.py", line 178, in compile
>     return _compile(pattern, flags)
>   File "E:\PYTHON\PYTHON22\lib\sre.py", line 228, in _compile
>     raise error, v # invalid expression
> error: internal: unsupported set operator

One problem that isn't biting you (yet):  \b inside a character class
doesn't mean word-boundary, it means the ASCII backspace character (chr(8)).
What you feared is true:  every word of the regexp docs is there for a
reason <0.5 wink>.

A second problem that isn't biting you (yet):  note that when I rewrote your
earlier pattern, I used re.escape to transform the word you inserted into
the regexp.  Else "funny characters" in the word will be taken as
instructions to the regexp engine, not as characters to be matched
literally.  re.escape(word) inserts backslashes as needed, so that every
character in word gets matched literally.

The last problem is that \A and \Z aren't characters at all, they're
"zero-width assertions".  They don't make sense inside a character class,
although it would be nice if the regexp compiler gave you a reasonable
message about that instead of blowing up with an internal error (you should
report that bug on SourceForge!).  You'll have to do it like this instead:

p = re.compile(r'(\A|\b|\s)%s(\Z|\b|\s)' % re.escape(word), re.IGNORECASE)