RegEx issues

Sat Jan 24 19:04:23 EST 2009

On Jan 25, 5:59 am, Scott David Daniels <Scott.Dani... at Acm.Org> wrote:
> Sean Brown wrote:
> > I have the following string ...:  "td[ct] = [[ ... ]];\r\n"
> > The ... (representing text in the string) is what I'm extracting ....
> > So I think the regex \[\[(.*)\]\]; should do it.
> > The problem is it appears that python is escaping the \ in the regex
> > because I see this:
> >>>> reg = '\[\[(.*)\]\];'
> >>>> reg
> > '\\[\\[(.*)\\]\\];'
> > Now to me looks like it would match the string - \[\[ ... \]\];
> > ...
>
> OK, you already have a good answer as to what is happening.
> I'll mention that raw strings were put in the language exactly for
> regex work.  They are useful for any time you need to use the backslash
> character (\) within a string (but not as the final character).
> For example:
>      len(r'\a\b\c\d\e\f\g\h') == 16 and len('\a\b\c\d\e\f\g\h') == 13
>
> If you get in the habit of typing regex strings as r'...' or r"...",
> and examining the patters with print(somestring), you'll ease your life.

All excellent suggestions, but I'm surprised that nobody has mentioned
the re.VERBOSE format.

Manual sez:
'''
re.X
re.VERBOSE
This flag allows you to write regular expressions that look nicer.
Whitespace within the pattern is ignored, except when in a character
class or preceded by an unescaped backslash, and, when a line contains
a '#' neither in a character class or preceded by an unescaped
backslash, all characters from the leftmost such '#' through the end
of the line are ignored.

That means that the two following regular expression objects that
match a decimal number are functionally equal:

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
'''

My comments:
(1)"looks nicer" is not the point; it's understandability
(2) if you need a space, use a character class ->[ ]<- not an
unescaped backslash ->\ <-
(3) the indentation in the manual doesn't fit my idea of "looks
nicer"; I'd do
a = re.compile(r"""
    \d +  # the integral part
    \.    # the decimal point
    \d *  # some fractional digits
    """, re.X)
(4) you can aid understandability by more indentation especially when
you have multiple capturing expressions and (?......) gizmoids e.g.
r"""
    (
         ..... # prefix
    )
    (
         (?......) # look-back assertion
         (?....) # etc etc
    )
"""
Worth a try if you find yourself going nuts getting the parentheses
matching.

Cheers,
John