common regular expressions
Andrew Dalke
dalke at dalkescientific.com
Wed Dec 12 18:00:44 EST 2001
Hey all,
I'm trying to put together a set of common regular expression
patterns for Martel (http://www.dalkescientific.com/Martel/ ).
The idea is to provide a set of `macros,' rather than having
people create them from scratch.
The ones I've thought of so far are:
Digits = \d+
Integer = [+-]?\d+
Float = something like (wrote this on the fly; likely wrong)
[+-]?([0-9](\.[0-9]*)?|\.[0-9]+)([eE][+-]?[0-9]+)?
I know this excludes NaN and [+-]Inf
Word = \w+
Spaces = [ \t]+ (NOT the same as \s+)
# Before you read further, yes, I know some of these
# patterns are too small or too large. Yes, the IPv4
# should only take a number up to 255 and there are
# different forms besides the dotted quad. Yes, the
# URN could have a "." in the last position. Yes,
# URLs are very complicated. Yes, I've read Friedl's
# book.
#
# All I'm trying to do is make it so people who don't
# know regular expressions very well (and haven't read
# Friedl's book) can get nearly all of what they want
# with little work.
_basic_host = [a-zA-Z]([a-zA-Z0-9]|-[a-zA-Z0-9])*
Hostname = {_basic_host}(\.{_basic_host})+ |
[0-9]+(\.[0-9]){3}
Email = {Word}@{Hostname}
RFC_From = ... a big mess
Mailto = mailto:{Email}
URN = [Uu][Rr][Nn]:([^\s.]+|\.(?!\s))+
URL = ... another big mess
US_phone = (\(\d+\) ?|\d\d\d[ .-])?\d\d\d[ .-]\d\d\d\d
(sigh, this excludes 1-800-555-1212)
# various dates -- would be nice to have more formal
# names for these rather than making up my own
month = 0[1-9]|1[012]?|[23456789]
day = 0[1-9]|1[0-9]?|2[0-9]?|3[01]?|[456789]
year_2 = \d\d
year_4 = \d\d\d\d
year_2_or_4 = \d\d(\d\d)?
US_date = {month}/{day}{year_2_or_4}
International_date = {day}[/.]{month}[/.]{year_2_or_4}
ISO8601_date = \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}
ARPA_date = ...
As you can see, these get hairy, fast.
Does anyone have suggestions for common, useful patterns?
Both needed patterns and existing definitions would help.
I searched the web for definitions but found nothing better
than what I listed here.
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list