regex confusion
Harvey Thomas
hst at empolis.co.uk
Tue Dec 9 11:17:37 EST 2003
John Hunter wrote
> In trying to sdebug why a certain regex wasn't working like I expected
> it to, I came across this strange (to me) behavior. The file I am
> trying to match definitely contains many instances of the letter 'a',
> so I would expect the regex
>
> rgxPrev = re.compile('.*?a.*?')
>
> to match it the string contents of the file. But it doesn't. Here is
> a complete example
>
> import re, urllib
> rgxPrev = re.compile('.*?a.*?')
>
> url =
> 'http://nitace.bsd.uchicago.edu:8080/files/share/showdown_exam
> ple2.html'
> s = urllib.urlopen(url).read()
> m = rgxPrev.match(s)
> print m
> print s.find('a')
>
> m is None (no match) and the s.find('a') reports an 'a' at index 48.
>
> I read the regex to mean non-greedy match of anything up to an a,
> followed by non-greedy match of anything following an a, which this
> file should match.
>
> Or am I insane?
>
> John Hunter
>
You need
rgxPrev = re.compile('.*?a.*?', re.DOTALL)
to get newline characters matched by ".". Your URL content starts with a newline
HTH
Harvey
_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.
More information about the Python-list
mailing list