[Tutor] Finding quoted strings in a text? A hands-on tutorial on DFA->regex conversion

Tue Dec 16 18:28:06 EST 2003

[Warning: there's heavy regular expression and finite state automata stuff
in this message.  Skip if it gets really confusing.]

Hi everyone,

For a small project that I'm working on, I've calculated out a regular
expression for finding quoted strings, but it's just absolutely hideous.
By quoted string, I mean a string surrounded with quotes, also allowing
for escaped quote charactesr.  For example:

   r'this is a "\"test\""'

is an example where I want to capture r"\"test\"" as a whole.

I'd like to get some input from the group to see the "best" way of doing
this, with regular expressions.  I need a regular expression solution
because I'm trying to combine this with other existing regular
expressions.

Here's my regular expression for detecting quoted strings:

###
regex = re.compile(r'(?:"[^"]*\\(?:.[^"]*\\)*.[^"]*")|(?:"[^"]*")')
###

This is one of the ugliest regexes I've written.  *grin* I wanted to avoid
using lookahead since older versions of Python have potential recursion
problems with it and long strings.  This solution only depends on
character classes and the '*' repetition operator.

The monster actually works:

###
>>> regex.findall('this is a piece of "quoted" text.')
['"quoted"']
>>> regex.findall('"hello" world, this is "another test".')
['"hello"', '"another test"']
>>> regex.findall(r'"this" even allows for "\"escaped\"" quotes, "yes?"')
['"this"', '"\\"escaped\\""', '"yes?"']
###

On the other hand, it's ugly, and probably nonoptimal.  Let me show how I
constructed the beast.

Constructing the regular expression by hand wasn't actually too bad.  In a
previous discussion, we mentioned that regular expressions and finite
state machines are equivalent to each other, and there's an easy finite
state machine for detecting quoted strings.  Here's a picture of it:

                       n
                       -
                      / /
               q      v /       q
      START--------> STATE1 -------->FINAL

                     |  ^
                    b|  |
                     |  | a
                     V  |
                    STATE2

In the picture,

     q     stands for the quote symbol        "
     n     stands for any non-quote character [^"]
     b     stands for the backslash symbol    \
     a     stands for any character           .

This finite state machine has four states, but it's not a regular
expression yet.  One systematic way of transforming it into a regular
expression is to iteratively "rip"  out states until we're left with only
START and FINAL.

As we "rip" states out, we need to repair the damage to the machine.  For
example, if we had something like:

                             c
                            -
                           / /
                    a     v /      b
           STATE1 -----> STATE2 ------> STATE3
             |                            ^
             |                            |
              \--------------------------/
                            d

then we can rip out STATE2, and our resulting machine will look like:

                         (ac*b) | d
                 STATE1 ------------> STATE3

Rip repair focuses on modifying edges between pairs of states so that
paths that go across the ripped state take account of any missing edges.
In the example above, the self-loop at STATE2 gets absorbed into the edge
between STATE1 and STATE3, as part of the 'c*' section of the regular
expression.

So here's what things look like as we start ripping states out.  Given our
original machine:

                       n
                       -
                      / /
               q      v /       q
      START--------> STATE1 -------->FINAL

                     |  ^
                    b|  |
                     |  | a
                     V  |
                    STATE2

we can rip out STATE1, repair the eges, and get a machine with three
states:

                      an*b
                       -
                      / /
             qn*b     v /     an*q
      START--------> STATE2 -------->FINAL
        |                              ^
        |                              |
         \----------------------------/
                     qn*q

This still does the same work that the original machine does.  Now we can
rip out STATE2, and patch up the edges again:

           (qn*b(an*b)*an*q) | (qn*q)
      START------------------------->FINAL

Finally, I substituted back 'a', 'n', 'b', and 'q' with the real symbols
that they stood for, and got back that huge ugly regular expression:

    r'("[^"]*\\(.[^"]*\\)*.[^"]*")|("[^"]*")'

The only difference between this and the regex from way above:

    regex = re.compile(r'(?:"[^"]*\\(?:.[^"]*\\)*.[^"]*")|(?:"[^"]*")')

is the addition of '?:' so that the groups that are defined aren't
"capturing" things.

So the regular expression is ugly, but the process of calculating it is
actually pretty systematic.  Anyway, I hope this was interesting!