[Tutor] about regular expression [breaking text into sentences]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Wed Mar 26 22:16:01 2003


> Computers aren't clever though. Just look att automatic translators,
> such as babelfish... Understanding language is too tough for machines...


I think Abdirizak is looking for a solution to the common case problem,
and that's something we can probably deal with.  The TeX text formatter,
for example, does use an arbitrary rule for determining the ending of
sentences.  And for the most part, it works!  For the exceptional cases,
TeX provides a way to say that a space should be treated as a word
separator rather than a sentence separator by using the special tilde
spacing character. ("~")


Although I can't remember TeX's rules, there's nothing stopping us from
defining our own arbitrary rule that isn't always correct, but gets us
somewhat close.  If we can break a problem down into the common case and
the exceptional cases, we've made some progress, even if we can't get it
down perfectly.


Let's define for the moment that a sentence break as an an ending
punctuation mark ('.', '!'), followed by whitespace.


###
>>> se_break = re.compile("""[.!?]       ## Some ending puncuation
...                          \s+         ## followed by space.
...                       """, re.VERBOSE)
>>> se_break.split("this is a test.  Can you see this?  Hello!  Goodbye
world")
['this is a test', 'Can you see this', 'Hello', 'Goodbye world']
###

This is a good start, and can be improved on: I know this isn't perfect,
but it might be enough to play around with teaching a computer to
recognize sentence boundaries.




> Look at the following cases:
>
> a) I really liked Al. Washington is not as nice.
>
> b) I really liked Al. Washington in that movie.
>
> Case a) is two sentences, and b) is only one. Not that people typically
> use a full stop when they abbreviate Albert to Al, but that was your
> suggestion... I'm sure we can make up other cases were it's unclear
> whther an abbreviation ends a sentence unless we perform some kind of
> non-trivial grammer analysis.


But I think we can attack this problem trivially by using a non-trivial
module.  *grin* The Monty Tagger tagging engine will do syntactic markup
of a sentence with pretty good accuracy:

    http://web.media.mit.edu/~hugo/montytagger/


and if we apply it to those sentences, we can see how Monty Tagger
interprets those words:


###
> I really liked Al. Washington is not as nice.

I/PRP really/RB liked/VBD Al/NNP ./. Washington/NNP is/VBZ not/RB as/IN
nice/JJ ./.
-- monty took 0.05 seconds. --

###


Monty Tagger attaches a semantic 'tag' to each word in our sentence.  And
now we can probably do something where we look at the stream of word tags,
and if a certain sequence of word tags occurs, like

    ["VBD", "NNP",  ".",  "NNP",   "VBZ"],

we can say with some certainty that there's a sentence break in between
there, since English doesn't let us have two verbs together like that in a
single sentence!  Using something like Monty Tagger to do the heavy
lifting may give us enough power to make good progress.



Hmmm... but I think I'm being handwavy here.  *grin* I have not tried this
approach yet, so the problem might be hard.  I'm positive the Natural
Language Processing folks have attacked this problem with vigor, though,
and I'm not convinced this is an impossible problem.  If I have time, I'll
start reading that book I picked up from the college bookstore,

    http://www-nlp.stanford.edu/fsnlp/promo/

and see what they say about this.