common regular expressions

Andrew Dalke dalke at dalkescientific.com
Wed Dec 12 18:00:44 EST 2001


Hey all,

  I'm trying to put together a set of common regular expression
patterns for Martel (http://www.dalkescientific.com/Martel/ ).
The idea is to provide a set of `macros,' rather than having
people create them from scratch.


The ones I've thought of so far are:

  Digits = \d+
  Integer = [+-]?\d+
  Float = something like (wrote this on the fly; likely wrong)
      [+-]?([0-9](\.[0-9]*)?|\.[0-9]+)([eE][+-]?[0-9]+)?
        I know this excludes NaN and [+-]Inf
  Word = \w+
  Spaces = [ \t]+  (NOT the same as \s+)

  # Before you read further, yes, I know some of these
  # patterns are too small or too large.  Yes, the IPv4
  # should only take a number up to 255 and there are
  # different forms besides the dotted quad.  Yes, the
  # URN could have a "." in the last position.  Yes,
  # URLs are very complicated.  Yes, I've read Friedl's
  # book.
  #
  # All I'm trying to do is make it so people who don't
  # know regular expressions very well (and haven't read
  # Friedl's book) can get nearly all of what they want
  # with little work.

  _basic_host = [a-zA-Z]([a-zA-Z0-9]|-[a-zA-Z0-9])*
  Hostname = {_basic_host}(\.{_basic_host})+ |
             [0-9]+(\.[0-9]){3}

  Email = {Word}@{Hostname}
  RFC_From = ... a big mess
  Mailto = mailto:{Email}
  URN = [Uu][Rr][Nn]:([^\s.]+|\.(?!\s))+
  URL = ... another big mess

  US_phone = (\(\d+\) ?|\d\d\d[ .-])?\d\d\d[ .-]\d\d\d\d
        (sigh, this excludes 1-800-555-1212)

  # various dates -- would be nice to have more formal
  # names for these rather than making up my own
  month = 0[1-9]|1[012]?|[23456789]
  day = 0[1-9]|1[0-9]?|2[0-9]?|3[01]?|[456789]
  year_2 = \d\d
  year_4 = \d\d\d\d
  year_2_or_4 = \d\d(\d\d)?
  US_date = {month}/{day}{year_2_or_4}
  International_date = {day}[/.]{month}[/.]{year_2_or_4}
  ISO8601_date = \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
  ARPA_date = ...


As you can see, these get hairy, fast.

Does anyone have suggestions for common, useful patterns?
Both needed patterns and existing definitions would help.

I searched the web for definitions but found nothing better
than what I listed here.

                    Andrew
                    dalke at dalkescientific.com







More information about the Python-list mailing list