regular expression - matches

Fri Jul 21 19:25:11 EDT 2006

On 22/07/2006 2:18 AM, Simon Forman wrote:
> John Salerno wrote:
>> Simon Forman wrote:
>>
>>> Python's re.match() matches from the start of the string,  so if you

(1) Every regex library's match() starts matching from the beginning of 
the string (unless of course there's an arg for an explicit starting 
position) -- where else would it start?

(2) This has absolutely zero relevance to the "match whole string or 
not" question.

>>> want to ensure that the whole string matches completely you'll probably
>>> want to end your re pattern with the "$" character (depending on what
>>> the rest of your pattern matches.)

*NO* ... if you want to ensure that the whole string matches completely, 
you need to end your pattern with "\Z", *not* "$".

Perusal of the manual would seem to be indicated :-)

>> Is that necessary? I was thinking that match() was used to match the
>> full RE and string, and if they weren't the same, they wouldn't match
>> (meaning a begin/end of string character wasn't necessary). That's wrong?

Yes. If the default were to match the whole string, then a metacharacter 
would be required to signal "*don't* match the whole string" ... 
functionality which is quite useful.

> 
> My understanding, from the docs and from dim memories of using
> re.match() long ago, is that it will match on less than the full input
> string if the re pattern allows it (for instance, if the pattern
> *doesn't* end in '.*' or something similar.)

Ending a pattern with '.*' or something similar is typically a mistake 
and does nothing but waste CPU cycles:

C:\junk>python -mtimeit -s"import 
re;s='a'+80*'z';m=re.compile('a').match" "m(s)"
1000000 loops, best of 3: 1.12 usec per loop

C:\junk>python -mtimeit -s"import 
re;s='a'+8000*'z';m=re.compile('a').match" "m(s)"
100000 loops, best of 3: 1.15 usec per loop

C:\junk>python -mtimeit -s"import 
re;s='a'+80*'z';m=re.compile('a.*').match" "m(s)"
100000 loops, best of 3: 1.39 usec per loop

C:\junk>python -mtimeit -s"import 
re;s='a'+8000*'z';m=re.compile('a.*').match" "m(s)"
10000 loops, best of 3: 24.2 usec per loop

The regex engine can't optimise it away because '.' means by default 
"any character except a newline" , so it has to trundle all the way to 
the end just in case there's a newline lurking somewhere.

Oh and just in case you were wondering:

C:\junk>python -mtimeit -s"import 
re;s='a'+8000*'z';m=re.compile('a.*',re.DOTALL).match" "m(s)"
1000000 loops, best of 3: 1.18 usec per loop

In this case, logic says the '.*' will match anything, so it can stop 
immediately.

> 
> I'd test this, though, before trusting it.
> 
> What the heck, I'll do that now:
> 
>>>> import re
>>>> re.match('ab', 'abcde')
> <_sre.SRE_Match object at 0xb6ff8790>
>>>> m = _

??? What's wrong with _.group() ???

>>>> m.group()
> 'ab'
>>>> print re.match('ab$', 'abcde')
> None
> 

HTH,
John