Using Groups inside Braces with Regular Expressions

John Machin sjmachin at lexicon.net
Sun Jul 13 21:13:11 EDT 2008


On Jul 14, 9:05 am, Chris <chriss... at gmail.com> wrote:

Misleading subject.

[] brackets or "square brackets"
{} braces or "curly brackets"
() parentheses or "round brackets"

> I'm trying to delimit  sentences in a block of text by defining the
> end-of-sentence marker as a period followed by a space followed by an
> uppercase letter or end-of-string.

... which has at least two problems:

(1) You are insisting on at least one space between the period and the
end-of-string (this can be overcome, see later).
(2) Periods are often dropped in after abbreviations and contractions
e.g. "Mr. Geo. Smith". You will get three "sentences" out of that.

>
> I'd imagine the regex for that would look something like:
> [^(?:[A-Z]|$)]\.\s+(?=[A-Z]|$)
>
> However, Python keeps giving me an "unbalanced parenthesis" error for
> the [^] part.

It's nice to know that Python is consistent with its error messages.

> If this isn't valid regex syntax,

If? It definitely isn't valid syntax. The brackets should delimit a
character class. You are trying to cram a somewhat complicated
expression into a character class, or you should be using parentheses.
However it's a bit hard to determine what you really meant that part
of the pattern to achieve.

> how else would I match
> a block of text that doesn't the delimiter pattern?

Start from the top down:
A sentence is:
   anything (with some qualifications)
followed by (but not including):
   a period
followed by
   either
      1 or more whitespaces then a capital letter
   or
      0 or more whitespaces then end-of-string

So something like this might do the trick:

>>> sep = re.compile(r'\.(?:\s+(?=[A-Z])|\s*(?=\Z))')
>>> sep.split('Hello. Mr. Chris X\nis here.\nIP addr 1.2.3.4.  ')
['Hello', 'Mr', 'Chris X\nis here', 'IP addr 1.2.3.4', '']



More information about the Python-list mailing list