regex confusion

Tue Dec 9 12:08:50 EST 2003

John Hunter wrote: 
> In trying to sdebug why a certain regex wasn't working like I expected
> it to, I came across this strange (to me) behavior.  The file I am
> trying to match definitely contains many instances of the letter 'a',
> so I would expect the regex
> 
>   rgxPrev = re.compile('.*?a.*?')
> 
> to match it the string contents of the file.  But it doesn't.  Here is
> a complete example
> 
>     import re, urllib
>     rgxPrev = re.compile('.*?a.*?')
> 
>     url = 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_example2.html'
>     s = urllib.urlopen(url).read()
>     m =  rgxPrev.match(s)
>     print m
>     print s.find('a')
> 
> m is None (no match) and the s.find('a') reports an 'a' at index 48.

By default, the dot '.' mathches anything *but a newline*.  So if there is 
no 'a' occurrence in the first line of the string the match will fail.
If you want the dot '.' to match anything you can define your regex as:

rgxPrev = re.compile('.*?a.*?', re.DOTALL)

> I read the regex to mean non-greedy match of anything up to an a,
> followed by non-greedy match of anything following an a, which this
> file should match.
> 
Also note that the las '.*?' is completely superfluous unless it is 
followed by something else.  Being non-greedy means that it will
much *as little as possible* as needed to match the whole pattern, 
and being at the end means that it can match the empty string and
still make the whole pattern happy.

Depending on what you are up to, it would also be wise to 
consider if it is possible to do it using string methods.  They
are quite easier to handle *and debug*.

If you need to do extensive text processing in Python, there is
a nice book by David Mertz, which discusses the issue (from
simple string methods up to regexes and parsers) at:

http://gnosis.cx/TPiP/

which might also be of help.

Xavier Martinez