regular expression - matches

Fri Jul 21 20:16:39 EDT 2006

On 22/07/2006 9:25 AM, John Machin wrote:

Apologies if this appears twice ... post to the newsgroup hasn't shown 
up; trying the mailing-list.

> On 22/07/2006 2:18 AM, Simon Forman wrote:
>> John Salerno wrote:
>>> Simon Forman wrote:
>>>
>>>> Python's re.match() matches from the start of the string,  so if you
> 
> (1) Every regex library's match() starts matching from the beginning of 
> the string (unless of course there's an arg for an explicit starting 
> position) -- where else would it start?
> 
> (2) This has absolutely zero relevance to the "match whole string or 
> not" question.
> 
>>>> want to ensure that the whole string matches completely you'll probably
>>>> want to end your re pattern with the "$" character (depending on what
>>>> the rest of your pattern matches.)
> 
> *NO* ... if you want to ensure that the whole string matches completely, 
> you need to end your pattern with "\Z", *not* "$".
> 
> Perusal of the manual would seem to be indicated :-)
> 
>>> Is that necessary? I was thinking that match() was used to match the
>>> full RE and string, and if they weren't the same, they wouldn't match
>>> (meaning a begin/end of string character wasn't necessary). That's 
>>> wrong?
> 
> Yes. If the default were to match the whole string, then a metacharacter 
> would be required to signal "*don't* match the whole string" ... 
> functionality which is quite useful.
> 
>>
>> My understanding, from the docs and from dim memories of using
>> re.match() long ago, is that it will match on less than the full input
>> string if the re pattern allows it (for instance, if the pattern
>> *doesn't* end in '.*' or something similar.)
> 
> Ending a pattern with '.*' or something similar is typically a mistake 
> and does nothing but waste CPU cycles:
> 
> C:\junk>python -mtimeit -s"import 
> re;s='a'+80*'z';m=re.compile('a').match" "m(s)"
> 1000000 loops, best of 3: 1.12 usec per loop
> 
> C:\junk>python -mtimeit -s"import 
> re;s='a'+8000*'z';m=re.compile('a').match" "m(s)"
> 100000 loops, best of 3: 1.15 usec per loop
> 
> C:\junk>python -mtimeit -s"import 
> re;s='a'+80*'z';m=re.compile('a.*').match" "m(s)"
> 100000 loops, best of 3: 1.39 usec per loop
> 
> C:\junk>python -mtimeit -s"import 
> re;s='a'+8000*'z';m=re.compile('a.*').match" "m(s)"
> 10000 loops, best of 3: 24.2 usec per loop
> 
> The regex engine can't optimise it away because '.' means by default 
> "any character except a newline" , so it has to trundle all the way to 
> the end just in case there's a newline lurking somewhere.
> 
> Oh and just in case you were wondering:
> 
> C:\junk>python -mtimeit -s"import 
> re;s='a'+8000*'z';m=re.compile('a.*',re.DOTALL).match" "m(s)"
> 1000000 loops, best of 3: 1.18 usec per loop
> 
> In this case, logic says the '.*' will match anything, so it can stop 
> immediately.
> 
>>
>> I'd test this, though, before trusting it.
>>
>> What the heck, I'll do that now:
>>
>>>>> import re
>>>>> re.match('ab', 'abcde')
>> <_sre.SRE_Match object at 0xb6ff8790>
>>>>> m = _
> 
> ??? What's wrong with _.group() ???
> 
>>>>> m.group()
>> 'ab'
>>>>> print re.match('ab$', 'abcde')
>> None
>>
> 
> HTH,
> John
> 
> 
>