[Repost] Re: [Tutor] newbie re question

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu Jul 10 16:57:01 2003


On Thu, 10 Jul 2003 tpc@csua.berkeley.edu wrote:

> hi Danny, I sent this to the list two days ago and to you yesterday and
> wasn't sure if you received it.


Hi tpc,


I'm still trying to catch up with email; my apologies for not responding
to this sooner.



> I am still confused why you provided for a negative lookahead.  I looked
> at amk's definition of a negative lookahead, and it seems to say the
> regex will not match if the negative lookahead condition is met.  So:
>
> >>> testsearch = re.compile('tetsuro(?!hello)', re.IGNORECASE)
> >>> testsearch.search('tetsurohello')
> >>> testsearch.search('hitetsuroone')
> <_sre.SRE_Match object at 0x860e4a0>


The negative lookahead in the URL detection routines allow us to find URLs
at the end of sentences.  For example, the string:


    """You can find out more information about Python
       at http://python.org/doc/Newbies.html.  It's very helpful."""

contains an embedded URL that can be tricky to pull out correctly.
Normally, periods are legal "word" characters in that regular expression,
and a regular expression without negative lookahead will probably grab
"http://python.org/doc/Newbies.html.", including the last period.



Looking back on the regular expression:

    (http://
          [\w\.-]+            ## bunch of "word" characters
    )

          \.?
          (?![\w.-/])


the negative lookup is there to detect a period that's being used to end a
sentence, and allows us to exclude it so that we can correctly extract
"http://python.org/doc/Newbies.html" without the trailing period.





> Here is an example of something similar that perplexes:
>
> >>> testsearch = re.compile(r'tetsuro\.?(?!hello)', re.IGNORECASE)
> >>> match = testsearch.search('tetsuro.hello')
> >>> match.group()
> 'tetsuro'


In this first case, we allow the period to be optional: that's the key
that allows this to match.  The regular expression, then, eats its input
up to 'tetsuro', and the negative lookahead is happy, since the remaining
input --- that lone period '.' --- is not equal to 'hello'.



If we remove '?' from the query, you'll see what you probably expect:

###
>>> import re
>>> testsearch = re.compile(r'tetsuro\.(?!hello)', re.IGNORECASE)
>>> match = testsearch.search('tetsuro.hello')
>>> match.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
AttributeError: 'NoneType' object has no attribute 'group'
###


And now it doesn't match at all.




> >>> match = testsearch.search('tetsuro..hello')
> >>> match.group()
> 'tetsuro.'
>
> Why wasn't the first period caught ?



Going back to the example,

###
testsearch = re.compile(r'tetsuro\.?(?!hello)', re.IGNORECASE)
match = testsearch.search('tetsuro.hello')
###

If the first period catches, then the negative lookahead constraint gets
violated since the rest of the input, 'hello', matches against (?!hello).
The regular expression engine is forced by constraint to not match that
period.


I have to admit: I'm not a regular expression guru.  *grin*  I have heard,
though, that the book "Mastering Regular Expressions",

    http://www.oreilly.com/catalog/regex/


Good luck!