Nlp, Python and period

Paul Boddie paul at boddie.org.uk
Mon Aug 4 09:32:12 EDT 2008


On 4 Aug, 12:34, Fred Mangusta <a... at bbb.it> wrote:
>
> thanks for replying. I'm interested in knowing more about your regex
> approach, but as you point out in your comment, seems like access to the
> sourceforge mail archive is restricted. Is there any way I can read
> about it? Would you be so kind to cut and paste it here for instance?

I can't log into SourceForge, possibly because I've forgotten my
password, but I can give you a fairly similar regular expression which
does some of the work:

sentence_pattern = re.compile(
    r'(' +
        r'[\(\"\[]*' +      # Quoting or bracketing (optional)
        r'[A-Z,a-z,0-9]' +  # Match sentence with specific start
character
        r'.+?' +            # Match sentence content - "?" means non-
greedy
        r'[\.\!\?]' +       # End of sentence
        r'[\)\"\]]*' +      # End quoting or bracketing
    r')' +
    r'(\s+)' +              # Spaces
    r'[\(\"\[]*' +          # Quoting or bracketing (optional)
    r'[A-Z,0-9]'            # Match sentence with specific start
character
    )

This is mostly the same as that posted to SourceForge, but with some
enhancements; I've indented the part which actually produces the
matched sentence text in a group. Unfortunately, some postprocessing
is required to deal with abbreviations, and I maintain a list of these
against which I test the supposed ends of sentences that the regular
expression provides. In addition, I also try and detect initials (eg.
G. van Rossum) which the regular expression may regard as the end of a
sentence.

As I noted, I'd be interested to hear of any better solutions which
don't involve training.

Paul



More information about the Python-list mailing list