matching a sentence, greedy up!

Tue Aug 12 12:44:04 EDT 2003

Helmut Jarausch wrote:

> Christian Buck wrote:
[...]
>> s = 'My text may i. E. look like this: This is the end.'
>> re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
>>          r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
>>          r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
>>          r'?:(?!\s[a-z]))')

>>          Sentence:  My text may i.
>>          Sentence:  This is the end.
>> 
>> Why isnt the above regexp greedier and matches the whole sentence?
>> 

> First, you don't need to escape any characters within a character
> group []. 

ok.

> The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
> you exclude the '.' . So it matches upto but not including the first
> dot. Now, as far as I can see, nothing else fits. 

right. so i could fix it by putting the [^.?!]+ at the end, so first it 
tries to match the given abrevations including the dot and that doesnt 
match it eats everything until the sentences end.

I thougt A|B would match the same strings as A|B does...

> So the output is
> just what I expected. How do you think you can differentiate between
> the end of a sentence and (the first part of) an abbreviation?

I'll provide common abbrevations in the regexp, like 'a.A.' as you see 
above.

thanks!

Christian