[Tutor] Making Regular Expressions readable

Stephen Nelson-Smith sanelson at gmail.com
Mon Mar 8 17:12:35 CET 2010


Hi,

I've written this today:

#!/usr/bin/env python
import re

pattern = r'(?P<ForwardedFor>^(-|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(,
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})*){1})
(?P<RemoteLogname>(\S*)) (?P<RemoteUser>(\S*))
(?P<Timestamp>(\[[^\]]+\]))
(?P<FirstLineOfRequest>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<Status>(\S*)) (?P<Size>(\S*))
(?P<Referrer>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<UserAgent>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)(
)?(?P<SiteIntelligenceCookie>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)'

regex = re.compile(pattern)

lines = 0
no_cookies = 0

for line in open('/home/stephen/scratch/feb-100.txt'):
  lines +=1
  line = line.strip()
  match = regex.match(line)

  if match:
    data = match.groupdict()
    if data['SiteIntelligenceCookie'] == '':
      no_cookies +=1
  else:
    print "Couldn't match ", line

print "I analysed %s lines." % (lines,)
print "There were %s lines with missing Site Intelligence cookies." %
(no_cookies,)

It works fine, but it looks pretty unreadable and unmaintainable to
anyone who hasn't spent all day writing regular expressions.

I remember reading about verbose regular expressions.  Would these help?

How could I make the above more maintainable?

S.

-- 
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com


More information about the Tutor mailing list