matching a sentence, greedy up!

Sun Aug 10 08:21:19 EDT 2003

Hi,

i'm writing a regexp that matches complete sentences in a german text, 
and correctly ignores abbrevations. Here is a very simplified version of 
it, as soon as it works i could post the complete regexp if anyone is 
interested (acually 11 kb):

[A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
\.)){3,}[\.\?\!]+(?!\s[a-z])

As you see i use [] for charsets because i don't want to depend on 
locales an speed does'nt matter. (i removed german chars in the above 
example) I do also allow - and _ within a sentence.

Ok, this is what i think i should do:
[A-Z]     	    	    	- start with an uppercase char.
(?:    	    	    	- don't make a group
[^\.\?\!]+    	    	- eat everything that does not look like an end
|    	    	    	    	- OR
[^a-zA-Z0-9\-_]    	- accept a non character 
(?:    	    	    	- followed by ...
[a-zA-Z0-9\-_]\.    	- a char and a dot like 'i.', '1.' (doesnt work!!!)
|    	    	    	    	- OR
\d*\.    	       	- a number and a dot
|    	    	    	    	- OR
z\.[\s\-]?B\.     	- some common abbrevations (one one here)   	
)){3,}    	    	    	- some times, at least 3
[\.\?\!]+    	    	- this is the end, and should also match '...'
(?!\s[a-z])    	    	- not followed by lowercase chars

here i a sample script:

- snip -
import string, re, pre
s = 'My text may i. E. look like this: This is the end.'
re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
    	r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
    	r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
    	r'?:(?!\s[a-z]))')
mo = re_satz.search(s)
if mo:
	print "found:"
	sentences = re_satz.findall(s)
	for s in sentences:
		print "Sentence: ", s
else:
	print "not found :-("

- snip -

Output:
    	found!
    	Sentence:  My text may i.
    	Sentence:  This is the end.

Why isnt the above regexp greedier and matches the whole sentence?

thx in advance

Christian