Re for Apache log file format

Piet van Oostrum piet at vanoostrum.org
Wed Oct 9 13:33:14 EDT 2013


Sam Giraffe <sam at giraffetech.biz> writes:

> Hi,
>
> I am trying to split up the re pattern for Apache log file format and seem to be having some
> trouble in getting Python to understand multi-line pattern:
>
> #!/usr/bin/python
>
> import re
>
> #this is a single line
> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-" "check_http/
> v1.4.16 (nagios-plugins 1.4.16)"'
>
> #trying to break up the pattern match for easy to read code
> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
>                      r'(?P<ident>\-)\s+'
>                      r'(?P<username>\-)\s+'
>                      r'(?P<TZ>\[(.*?)\])\s+'
>                      r'(?P<url>\"(.*?)\")\s+'
>                      r'(?P<httpcode>\d{3})\s+'
>                      r'(?P<size>\d+)\s+'
>                      r'(?P<referrer>\"\")\s+'
>                      r'(?P<agent>\((.*?)\))')
>
> match = re.search(pattern, string)
>
> if match:
>     print match.group('ip')
> else:
>     print 'not found'
>
> The python interpreter is skipping to the 'math = re.search' and then the 'if' statement right
> after it looks at the <ip>, instead of moving onto <ident> and so on.

Although you have written the regexp as a sequence of lines, in reality it is a single string, and therefore pdb will do only a single step, and not go into its "parts", which really are not parts.
>
> mybox:~ user$ python -m pdb /Users/user/Documents/Python/apache.py
>> /Users/user/Documents/Python/apache.py(3)<module>()
> -> import re
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(5)<module>()
> -> string = '192.168.122.3 - - [29/Sep/2013:03:52:33 -0700] "GET / HTTP/1.0" 302 276 "-"
> "check_http/v1.4.16 (nagios-plugins 1.4.16)"'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(7)<module>()
> -> pattern = re.compile(r'(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+'
> (Pdb) n
>> /Users/user/Documents/Python/apache.py(17)<module>()
> -> match = re.search(pattern, string)
> (Pdb)

Also as Andreas has noted the r'(?P<referrer>\"\")\s+' part is wrong. It should probably be 
r'(?P<referrer>\".*?\")\s+'

And the r'(?P<agent>\((.*?)\))') will also not match as there is text outside the (). Should probably also be
r'(?P<agent>\".*?\")') or something like it.
-- 
Piet van Oostrum <piet at vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]



More information about the Python-list mailing list