matching a sentence, greedy up!
Christian Buck
cbuck at lantis.de
Tue Aug 12 12:44:04 EDT 2003
Helmut Jarausch wrote:
> Christian Buck wrote:
[...]
>> s = 'My text may i. E. look like this: This is the end.'
>> re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
>> r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
>> r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
>> r'?:(?!\s[a-z]))')
>> Sentence: My text may i.
>> Sentence: This is the end.
>>
>> Why isnt the above regexp greedier and matches the whole sentence?
>>
> First, you don't need to escape any characters within a character
> group [].
ok.
> The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
> you exclude the '.' . So it matches upto but not including the first
> dot. Now, as far as I can see, nothing else fits.
right. so i could fix it by putting the [^.?!]+ at the end, so first it
tries to match the given abrevations including the dot and that doesnt
match it eats everything until the sentences end.
I thougt A|B would match the same strings as A|B does...
> So the output is
> just what I expected. How do you think you can differentiate between
> the end of a sentence and (the first part of) an abbreviation?
I'll provide common abbrevations in the regexp, like 'a.A.' as you see
above.
thanks!
Christian
More information about the Python-list
mailing list