problem with regex

Fri Nov 22 05:59:43 EST 2002

x-herbert wrote:

> I will e.g. only find the word eMail in a string. Her is a part of my
> code... .....................
> key = "eMail"
> i = ".....bla bla XXXX ..."  #see below for XXXX...
> regkey = "\\b"+key+"\\b"     # for find key alone
> regex = re.compile(regkey)
> result = regex.search(i)
> if result:
>     print "find"
> else:
>     print "find not"
> .......................
> Result:
> XXXX:       print:
> eMail       find   # o.k. ;-)
> eMailNews   find not
> eMail_News  find not
> eMail-News  find # upps why!!!!!!!!!!!!!!!!
> eMail*News  find # upps why!!!!!!!!!!!!!!!!
> eMail?News  find # upps why!!!!!!!!!!!!!!!!
> eMail#News  find # upps why!!!!!!!!!!!!!!!!
> ...etc.
> 
> I think, the regkey = "\beMail\b" find this word alone.... ?????

It finds this word as long as it's not part of some OTHER word, i.e., as 
long as it has word boundaries on each side.  And that is exactly what
is happening in the examples you relate; punctuation DOES give a word
boundary, and hyphens, question marks, etc, are punctuation.

If what you want is just to check if string i is equal to string
key, then "if i==key:" is by far fastest.  If for some weird reason
you MUST perform this task with a RE, you can use r'\AeMail\Z' (or
'^eMail$' if you want to match start/end of line too, not JUST of
entire-string apart from any line-issues).  You don't need the
starting anchor (r'\A' or '^') if you use the match method rather
than the search method.

If what you actually want is something different yet, such as
deciding that for your purposes question marks are not puctuation
but some other characters are, you can probably do it, with more
advanced tools such as lookahead and lookbehind, but before we
delve into such complications it MIGHT be better for you to
clarify exactly what you need (and if it's just equality, why
you need it in a RE rather than as plain ==).

Alex