Regular expression to match a #

Duncan Booth duncan.booth at invalid.invalid
Fri Aug 12 03:42:30 EDT 2005


John Machin wrote:

> The point was made in a context where the OP appeared to be reading a 
> line at a time and parsing it, and re.compile(r'something').match() 
> would do the job; re.compile(r'^something').search() will do the job too 
> -- BECAUSE ^ means start of line anchor -- but somewhat redundantly, and 
> very inefficiently in the failing case with dopey implementations of 
> search() (which apply match() at offsets 0, 1, 2, .....).

Answering the question you think should have been asked rather than the 
question which was actually asked is a great newsnet tradition, and often 
more helpful to the poster than a straight answer would have been. However, 
you do have to be careful to make it clear that is what you are doing.

The OP did not use the word 'line' once in his post. He simply said he was 
searching a string. You didn't use the word 'line' either. If you are going 
to read more into the question than was actually asked, please try to say 
what question it is you are actually answering.

If he is using individual lines and re.match then the presence or absence 
of a leading ^ makes virtually no difference. If he is looking for all 
occurences in a multiline string then re.search with an anchored match is a 
correct way to do it (splitting the string into lines and using re.match is 
an alternative which may or may not be appropriate).

Either way, putting the focus on the ^ was inappropriate: the issue is 
whether to use re.search or re.match. If you assume that the search fails 
on an 80 character line, then I get timings of 6.48uS (re.search), 4.68uS 
(re.match with ^), 4.66uS (re.match without ^). A failing search on a 
10,000 character line shows how performance will degrade (225uS for search, 
no change for match), but notice that searching 1 10,000 character string 
is more than twice as fast as matching 125 80 character lines.

I don't understand what you think an implementation of search() can do in 
this case apart from trying for a match at offsets 0, 1, 2, ...? It could 
find a match at any starting offset within the string, so it must scan the 
string in some form. A clever regex implementation will use Boyer-Moore 
where it can to avoid checking every index in the string, but for the 
pattern I suggested it would suprise me if any implementations actually 
manage much of an optimisation.



More information about the Python-list mailing list