In defence of 80-char lines
Neil Cerutti
neilc at norwich.edu
Thu Apr 4 11:56:56 EDT 2013
On 2013-04-04, Roy Smith <roy at panix.com> wrote:
> re.X is a pretty cool tool for making huge regexes readable.
> But, it turns out that python's auto-continuation and string
> literal concatenation rules are enough to let you get much the
> same effect. Here's a regex we use to parse haproxy log files.
> This would be utter line noise all run together. This way, it's
> almost readable :-)
>
> pattern = re.compile(r'haproxy\[(?P<pid>\d+)]: '
> r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):'
> r'(?P<client_port>\d{1,5}) '
>
> r'\[(?P<accept_date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})] '
> r'(?P<frontend_name>\S+) '
> r'(?P<backend_name>\S+)/'
> r'(?P<server_name>\S+) '
> r'(?P<Tq>(-1|\d+))/'
> r'(?P<Tw>(-1|\d+))/'
> r'(?P<Tc>(-1|\d+))/'
> r'(?P<Tr>(-1|\d+))/'
> r'(?P<Tt>\+?\d+) '
> r'(?P<status_code>\d{3}) '
> r'(?P<bytes_read>\d+) '
> r'(?P<captured_request_cookie>\S+) '
> r'(?P<captured_response_cookie>\S+) '
> r'(?P<termination_state>[\w-]{4}) '
> r'(?P<actconn>\d+)/'
> r'(?P<feconn>\d+)/'
> r'(?P<beconn>\d+)/'
> r'(?P<srv_conn>\d+)/'
> r'(?P<retries>\d+) '
> r'(?P<srv_queue>\d+)/'
> r'(?P<backend_queue>\d+) '
> r'(\{(?P<request_id>.*?)\} )?'
> r'(\{(?P<captured_request_headers>.*?)\} )?'
> r'(\{(?P<captured_response_headers>.*?)\} )?'
> r'"(?P<http_request>.+)"'
> )
>
> And, for those of you who go running in the other direction every time
> regex is suggested as a solution, I challenge you to come up with easier
> to read (or write) code for parsing a line like this (probably
> hopelessly mangled by the time you read it):
>
> 2013-04-03T00:00:00+00:00 localhost haproxy[5199]: 10.159.19.244:57291
> [02/Apr/2013:23:59:59.811] app-nodes next-song-nodes/web8.songza.com
> 0/0/3/214/219 200 593 sessionid=NWiX5KGOdvg6dSaA
> sessionid=NWiX5KGOdvg6dSaA ---- 249/249/149/14/0 0/0
> {4C0ABFA9-515B6DEF-933229} "POST
> /api/1/station/892337/song/16024201/notify-play HTTP/1.0"
The big win from the above seems to me the groupdict result. The
parsing is also very simple, with virtually no nesting. It's a
good application of re.
It seems easy enough to do with str methods, but would it be an
improvement?
I ran out of time before the prototype was finished, but here's a
sketch.
import re
import datetime
import pprint
s =('2013-04-03T00:00:00+00:00 localhost haproxy[5199]: 10.159.19.244:57291'
' [02/Apr/2013:23:59:59.811] app-nodes next-song-nodes/web8.songza.com'
' 0/0/3/214/219 200 593 sessionid=NWiX5KGOdvg6dSaA'
' sessionid=NWiX5KGOdvg6dSaA ---- 249/249/149/14/0 0/0'
' {4C0ABFA9-515B6DEF-933229}'
' "POST /api/1/station/892337/song/16024201/notify-play HTTP/1.0"')
def get_haproxy(s):
prefix = 'haproxy['
if s.startswith(prefix):
return int(s[len(prefix):s.index(']')])
return False
def get_client_info(s):
ip, colon, port = s.partition(':')
if colon != ':':
return False
else:
return ip, int(port)
def get_accept_date(s):
try:
return datetime.datetime.strptime(s, '[%d/%b/%Y:%H:%M:%S.%f]')
except ValueError:
return False
def get_backend(s):
name, slash, server = s.partition('/')
if slash != '/':
return False
else:
return name, server
def get_track_info(s):
try:
return s.split('/')
except TypeError:
return False
matchers = [
(None, None),
(None, 'localhost'),
('haproxy', get_haproxy),
(('client_ip', 'client_port'), get_client_info),
('accept_date', get_accept_date),
('frontend_name', lambda s: s),
(('backend_name', 'server_name'), get_backend),
(('Tq', 'Tw', 'Tc', 'Tr', 'Tt'), get_track_info),
]
result = {}
for i, s in enumerate(s.split()):
if i < len(matchers): # I'm not finished writing matchers yet.
key, matcher = matchers[i]
if matcher is None:
pass
else:
if isinstance(matcher, str):
value = matcher == s
else:
value = matcher(s)
if value is False:
raise ValueError('Parse error {}: {} "{}"'.format(
key, matcher, s))
if isinstance(key, tuple):
result.update(zip(*[key, value]))
elif key is not None:
result[key] = value
pprint.pprint(result)
The engine would need to be improved in implementation and made
more flexible once it's working and tested. I think the error
handling is a good feature and the ability to customize parsing
and return custom types is cool.
--
Neil Cerutti
More information about the Python-list
mailing list