regular expression - matches
John Machin
sjmachin at lexicon.net
Fri Jul 21 20:16:39 EDT 2006
On 22/07/2006 9:25 AM, John Machin wrote:
Apologies if this appears twice ... post to the newsgroup hasn't shown
up; trying the mailing-list.
> On 22/07/2006 2:18 AM, Simon Forman wrote:
>> John Salerno wrote:
>>> Simon Forman wrote:
>>>
>>>> Python's re.match() matches from the start of the string, so if you
>
> (1) Every regex library's match() starts matching from the beginning of
> the string (unless of course there's an arg for an explicit starting
> position) -- where else would it start?
>
> (2) This has absolutely zero relevance to the "match whole string or
> not" question.
>
>>>> want to ensure that the whole string matches completely you'll probably
>>>> want to end your re pattern with the "$" character (depending on what
>>>> the rest of your pattern matches.)
>
> *NO* ... if you want to ensure that the whole string matches completely,
> you need to end your pattern with "\Z", *not* "$".
>
> Perusal of the manual would seem to be indicated :-)
>
>>> Is that necessary? I was thinking that match() was used to match the
>>> full RE and string, and if they weren't the same, they wouldn't match
>>> (meaning a begin/end of string character wasn't necessary). That's
>>> wrong?
>
> Yes. If the default were to match the whole string, then a metacharacter
> would be required to signal "*don't* match the whole string" ...
> functionality which is quite useful.
>
>>
>> My understanding, from the docs and from dim memories of using
>> re.match() long ago, is that it will match on less than the full input
>> string if the re pattern allows it (for instance, if the pattern
>> *doesn't* end in '.*' or something similar.)
>
> Ending a pattern with '.*' or something similar is typically a mistake
> and does nothing but waste CPU cycles:
>
> C:\junk>python -mtimeit -s"import
> re;s='a'+80*'z';m=re.compile('a').match" "m(s)"
> 1000000 loops, best of 3: 1.12 usec per loop
>
> C:\junk>python -mtimeit -s"import
> re;s='a'+8000*'z';m=re.compile('a').match" "m(s)"
> 100000 loops, best of 3: 1.15 usec per loop
>
> C:\junk>python -mtimeit -s"import
> re;s='a'+80*'z';m=re.compile('a.*').match" "m(s)"
> 100000 loops, best of 3: 1.39 usec per loop
>
> C:\junk>python -mtimeit -s"import
> re;s='a'+8000*'z';m=re.compile('a.*').match" "m(s)"
> 10000 loops, best of 3: 24.2 usec per loop
>
> The regex engine can't optimise it away because '.' means by default
> "any character except a newline" , so it has to trundle all the way to
> the end just in case there's a newline lurking somewhere.
>
> Oh and just in case you were wondering:
>
> C:\junk>python -mtimeit -s"import
> re;s='a'+8000*'z';m=re.compile('a.*',re.DOTALL).match" "m(s)"
> 1000000 loops, best of 3: 1.18 usec per loop
>
> In this case, logic says the '.*' will match anything, so it can stop
> immediately.
>
>>
>> I'd test this, though, before trusting it.
>>
>> What the heck, I'll do that now:
>>
>>>>> import re
>>>>> re.match('ab', 'abcde')
>> <_sre.SRE_Match object at 0xb6ff8790>
>>>>> m = _
>
> ??? What's wrong with _.group() ???
>
>>>>> m.group()
>> 'ab'
>>>>> print re.match('ab$', 'abcde')
>> None
>>
>
> HTH,
> John
>
>
>
More information about the Python-list
mailing list