Regular expressions

Thu Nov 5 09:00:00 EST 2015

On 2015-11-05 23:05, Steven D'Aprano wrote:
> Oh the shame, I knew that. Somehow I tangled myself in a knot,
> thinking that it had to be 1 *followed by* zero or more characters.
> But of course it's not a glob, it's a regex.

But that's a good reminder of fnmatch/glob modules too.  Sometimes
all you need is to express a simple glob, in which case using a
regexp can cloud the clarity.

The overarching principle is to go for clarity & simplicity, rather
than favoring built-ins/glob/regex/parser modules all the time.

Want to test for presence in a string?  Just use the builtin "a in b"
test.  At the beginning/end?  Use .startswith()/.endswith() for
clarity.  Need to check if a string is purely
digits/alpha/alphanumerics/etc?  Use the
string .is{alnum,alpha,decimal,digit,identifier,lower,numeric,printable,space,title,upper}
methods on the string.

For simple wild-carding, use the fnmatch module to do simple
globbing.

For more complex pattern matching, you've got regexps.

Finally, for occasions when you're searching for repeated/nested
structures, using an add-on module like pyparsing will give you
clearer code.

Oh, and with regexps, people should be less afraid of verbose
multi-line strings with commenting

  r = re.compile(r"""
    ^                       # start of the string
    (?P<year>\d{4})         # capture 4 digits
    -                       # a literal dash
    (?P<month>\d{1,2})      # capture 1-2 digits
    -                       # another literal dash
    (?P<day>\d{1,2})        # capture 1-2 digits
    _                       # a literal underscore
    (?P<accountnum>         # capture the account-number
      [A-Z]{1,3}               # 1-3 letters
      \d+                      # followed by 1+ digits
      )
    \.txt                   # the extension of the file (ignored)
    $                       # the end of the string
    """, re.VERBOSE)

They are a LOT easier to come back to if you haven't touched the code
for a year.

-tkc