[Tutor] help with regular expressions

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Thu Feb 5 19:34:25 EST 2004



On Thu, 5 Feb 2004, Christopher Spears wrote:

> I'm trying to figure out regular expressions and am completely baffled!
> I understand the concept because there is something similar in UNIX, but
> for some reason, Python regular expressions don't make any sense to me!
> Are there some good tutorials that can help explain this subject to me?

Hi Chris,

Yes, there's a tutorial-style Regular Expression HOWTO by A.M. Kuchling:

    http://www.amk.ca/python/howto/regex/



Regular expressions allow us to define text patterns.  For example, we can
define a pattern of a bunch of 'a's:

###
>>> import re
>>> pattern = re.compile('a+')
>>> pattern
<_sre.SRE_Pattern object at 0x8126060>
###



'pattern' is a regular expression that can recognize all continuous
patterns of the letter 'a'.  That is, if we give it a string with 'a's,
it'll recognize exactly where they are.


Let's see what it does on a simple example:

###
>>> pattern.findall('this is a test')
['a']
###

Here, it found the letter 'a'.



Let's try something else:

###
>>> pattern.findall('aaabaracccaaaadaabraaaa')
['aaa', 'a', 'a', 'aaaa', 'aa', 'aaaa']
###

And here, it found all 'a' sequences in that string.



Does this make sense so far?  The pattern above is deliberately simple,
but regular expressions can get a little more complicated.


For example, here's a regular expression that tries to detect date strings
of the form '2/5/2004' (like date strings):

###
>>> date_regex = re.compile('[0-9]+/[0-9]+/[0-9]+')
>>> date_regex.findall("this is a test on 02/05/2004, right?")
['02/05/2004']
###


The regular expression is trying to say "a bunch of digits, followed by a
a slash, followed by another bunch of digits, followed by a slash, and
then topped with another bunch of digits".  Whew.  *grin*



Caveat: the pattern above is too lenient for catching date strings. It
also catches stuff like 2005/2/5, or even things like:

###
>>> date_regex.findall("looky 1/2/3 or /4/5/6/")
['1/2/3', '4/5/6']
###

So there's something of an art to writing good regular expressions that
are both general and specific.



If you have questions, please feel free to ask.




More information about the Tutor mailing list