regex confusion

Harvey Thomas hst at empolis.co.uk
Tue Dec 9 11:17:37 EST 2003


John Hunter wrote
> In trying to sdebug why a certain regex wasn't working like I expected
> it to, I came across this strange (to me) behavior.  The file I am
> trying to match definitely contains many instances of the letter 'a',
> so I would expect the regex
> 
>   rgxPrev = re.compile('.*?a.*?')
> 
> to match it the string contents of the file.  But it doesn't.  Here is
> a complete example
> 
>     import re, urllib
>     rgxPrev = re.compile('.*?a.*?')
> 
>     url = 
> 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_exam
> ple2.html'
>     s = urllib.urlopen(url).read()
>     m =  rgxPrev.match(s)
>     print m
>     print s.find('a')
> 
> m is None (no match) and the s.find('a') reports an 'a' at index 48.
> 
> I read the regex to mean non-greedy match of anything up to an a,
> followed by non-greedy match of anything following an a, which this
> file should match.
> 
> Or am I insane?
> 
> John Hunter
> 

You need 
rgxPrev = re.compile('.*?a.*?', re.DOTALL)

to get newline characters matched by ".". Your URL content starts with a newline

HTH

Harvey

_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.





More information about the Python-list mailing list