[Edu-sig] intro to regular expressions

kirby urner kirby.urner at gmail.com
Thu Apr 15 20:29:14 CEST 2010


Here's a module people could easily expand upon, staying with
Jabberwocky as the target text.

I'm by no means a master of the regexp.  For example, I wanted to pick
out all sentences with
Jabberwock including those beginning and ending with quote marks (if
present), i.e. keeping
the quotes in the match.  My current attempt loses the quote marks,
keeping the enclosed
sentence.

One could imagine 20-30 more regexps, if not hundreds, populating this
file.  The doctest
version could display expected output (except it gets kinda verbose
(appended) -- maybe
selected examples only...).

Kirby

===

"""
Playing with regular expressions...
GPL 2010 4D Solutions

This small suite of tests could easily be
augmented with more elaborate ones, or
simply variations on the theme.  Consider
this a workbench for test out your regexps.
"""

import re

poem = """
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"

He took his vorpal sword in hand:
Long time the manxome foe he sought-
So rested he by the Tumtum tree,
And stood awhile in thought.

And as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One, two! One, two! and through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

"And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!"
He chortled in his joy.

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.
"""

def show_all(title, regexp, match_list):
    print "%s\n%s" % (title, len(title) * "=")
    print "regexp: %s\n" % regexp
    if match_list:
        for list_item in match_list:
            print list_item,'\n----'
    else:
        print "No Matches"
    print "\n\n"

def test0():
    """
    Show the line of text in which Jabberwock appears,
    with ^ matching after \n, not just at the start
    of the string (the purpose of the MULTILINE flag).
    Given the word Jabberwock is not in the first line,
    there is no match without MULTILINE
    """
    regexp = r"^.*Jabberwock.*$"
    p = re.compile(regexp, re.MULTILINE)
    m = p.findall(poem)
    show_all("Lines with Jabberwock", regexp, m)

def test1():
    """
    Sentences in which Jabberwok appears, starting
    with a capital letter and ending with punctuation.
    The non-greedy .*? matches across \n because of
    DOTALL.  After the first capitalized word, the
    matcher goes through any character that's not
    a terminating punctuation mark, through the
    string Jabberwock, and on to the terminus.
    """
    regexp = r'[A-Z]\w+\b[^.!?"]+Jabberwock.*?[?!.]'
    # how to include outside quotes if present?
    p = re.compile(regexp, re.MULTILINE | re.DOTALL)
    m = p.findall(poem)
    show_all("Sentences with Jabberwock", regexp, m)

def test2():
    """
    Find all strings enclosed in quotes (") that also
    and with an exclamation point.  The *? makes *
    behave in a "non-greedy" manner, so the first
    satisfying pattern is considered a match.
    """
    regexp = r'".*?!"'
    p = re.compile(regexp, re.MULTILINE | re.DOTALL)
    m = p.findall(poem)
    show_all("Exclamations", regexp, m)

def test3():
    """
    Here we're looking for words starting with capital
    letters, then we're grabbing up to 3 characters on
    either side, including newlines if need be. The DOTALL
    is what picks up newlines.
    """
    regexp = r'.{0,3}[A-Z]\w+\b.{0,3}'
    p = re.compile(regexp, re.MULTILINE | re.DOTALL)
    m = p.findall(poem)
    show_all("Capitals", regexp, m)

def test4():
    """
    Here we're looking for words starting with capital
    letters, then we're grabbing up to 3 characters on
    either side, including newlines if need be. The DOTALL
    is what picks up newlines.
    """
    regexp = r'.{0,3}[A-Z]\w+\b.{0,3}'
    p = re.compile(regexp, re.MULTILINE | re.DOTALL)
    m = p.findall(poem)
    show_all("Capitals", regexp, m)

def alltests():
    test0()
    test1()
    test2()
    test3()

if __name__ == "__main__":
    alltests()

===

Lines with Jabberwock
=====================
regexp: ^.*Jabberwock.*$

"Beware the Jabberwock, my son!
----
The Jabberwock, with eyes of flame,
----
"And hast thou slain the Jabberwock?
----



Sentences with Jabberwock
=========================
regexp: [A-Z]\w+\b[^.!?"]+Jabberwock.*?[?!.]

Beware the Jabberwock, my son!
----
And as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!
----
And hast thou slain the Jabberwock?
----



Exclamations
============
regexp: ".*?!"

"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"
----
"And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!"
----



Capitals
========
regexp: .{0,3}[A-Z]\w+\b.{0,3}


'Twas br
----
es
Did gy
----
e;
All mi
----
s,
And th
----


"Beware th
----
e Jabberwock, m
----
n!
The ja
----
h!
Beware th
----
e Jubjub bi
----
un
The fr
----
us Bandersnatch!"

----

He to
----
d:
Long ti
----
t-
So re
----
he Tumtum tr
----
e,
And st
----
.

And as
----
d,
The Ja
----
e,
Came wh
----
d,
And bu
----
!

One, t
----
o! One, t
----
gh
The vo
----
k!
He le
----
ad
He we
----


"And ha
----
he Jabberwock?
C
----
y! Callooh! C
----
!"
He ch
----


'Twas br
----
es
Did gy
----
e;
All mi
----
s,
And th
----


More information about the Edu-sig mailing list