Regex error in python (weird?)

Andrew Dalke dalke at acm.org
Tue Aug 29 23:26:26 EDT 2000


Aleksandar Alimpijevic asked about regular expressions:
>My regular expression is:
>   CHKformat_rex = re.compile('(?P<IP>\d{1,3}(\.\d{1,3}){3,3}) \S+ \S+ '
>'(?P<date>\[\d{2,2}/[a-zA-z]{3,3}/\d{4,4}(:\d{2,2}){3,3} [+-]\d{4,4}\])'
>                    '"(?P<request>GET|HEAD|POST) '
>                    '(?P<req_fname>/(\S+/?)*) HTTP/\d\.\d{1,2}"
>(?P<reply_code>\d{3,3}) (?P<reply_size>\d+|-)')

(That probably came out wrong what with line wrapping.)

Here's some things to try:

Replace {2,2} with {2} and {3,3} with {3} and ... That makes your
regex easier to understand, though it won't change anything.

To get the filename you are using '/(\S+/?)*)' Instead, try '/\S*'.
'/' is a \S character, so \S will greedily read up to the space then
backtrack on errors; there would be n**2 backtracks where n is the
number of '/'s in your code.  But your example only has one, so that's
not the problem.  Still, change the code anyway.

Um, actually it's worse than that.  The '/?' means you will have n**2
backtracks for any error, where n is the number of characters in the
filename, or 12**2 backtracks.  It basically allows
  '/index.html'
  '/index.htm' + no '/' + 'l' + no '/'
  '/index.ht' + no '/' + 'm' + no '/' + 'l' + no '/'
  '/index.ht' + no '/' + 'ml' + no '/'
   ...

Each of those is a successful parse, but then HaTTP fails later on,
triggering another backtrack, which fails, causing another backtrack,
which fails, causing ..... a long wait for it to finally fail.

You aren't actually doing validation on the data line.  (Eg, you allow
an octect like 999.987.456.732 which is quite illegal.)  So you might
instead use a set of string operations like (untested):

s = """201.120.68.38 - - [05/Jun/2000:16:30:29 +1000] "HEAD /index.html
HTTP/1.0" 304 -"""
import string

fields = string.split(s)

ip_addr = fields[0]
date = fields[3][1:12]
time = fields[3][13:20]
tz = fields[4][:-1]

# Look for the quote to get the filename (between the space after the
# first quote and the space before the last quote)
q1 = string.find(s, '"')
q2 = string.find(s, '"', q1+1)
filename = s[string.find(s, " ", q1)+1:string.rfind(s, " ", q2)]

reply_code = fields[-2]
reply_size = fields[-1]

Maybe not as terse to read, and maybe not even as fast as an re,
but a lot easier to debug.

                    Andrew
                    dalke at acm.org






More information about the Python-list mailing list