[Tutor] Python Regex re.search() to parse system logs
Cameron Simpson
cs at cskk.id.au
Tue Dec 22 01:06:06 EST 2020
Comments below...
On 21Dec2020 17:48, Mike Wilbur <wilbur6453 at gmail.com> wrote:
>print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER
>(good_user)")) # call to run function with parameter
># Desired output per below:
># Jul 6 14:01:23 pid:29440
>
>My code so far keeps pulling in the string "computer.name CRON[". I can
>get the date & time OR the pid #. But not by themselves. I have not
>looked at adding the "pid:" to the output yet.
>
>*My code:*
>print(re.search("(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})", "Jul 6 14:01:23
>computer.name CRON[29440]: USER (good_user)"))
A recommendation: use "raw strings" when writing regexps:
r"(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})"
or:
r'(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})'
That leading "r" marks this as a "raw string", which means in particular
that backslahses are not special to Python. Since regexps use
backslashes to represent character classes and other things, using a raw
string prevents Python's own backslash stuff from getting in the way.
You will have fewer accidents this way.
><re.Match object; span=(0, 39), match='Jul 6 14:01:23 computer.name
>CRON[29440'>
>
>Produced code using group names that isolates desired output. But this
>will not work with re.search() I believe.
It works just find with re.search. re.match and re.search both return
"Match" objects, they just differ in where they start in the text.
>I think I'd need to use re.sub() instead.
No need.
Let's look at your regexp (ignoring the quotes - they're just for
Python):
(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})
Brackets () in a regexp group a section as a single unit. You don't need
brackets around the whole thing.
^[\w \:]{15}.*[^a-z\.CRON][0-9]{5}
Let's look at each part:
^ Start of string.
[\w \:] A single character which is a "word" character or a space or
a colon.
{15} Exactly 15 such characters.
.* Any number of characters (zero or more of '.', which is any
single character).
[^a-z\.CRON] A single character which is not one of a-z, ., C, R, O, N.
[0-9] A digit. Which can also be written \d
{5} Exactly 5 such characters, so exactly 5 digits.
I think your "CRON" above should be _outside_ the [] character range.
I recommend starting with a sample input line and deciding how to match
each piece alone. You often have a choice here - take the simplest
choice available.
So:
Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)
Your "[\w :]{15}" looks good. Dates in these logs are a fixed length and
this will be reliable.
After that I tend to be less believing. So I'd match the spaces with
\s+, meaning "1 or more space characters".
A computer name may have several characters, but won't have whitespace.
You know where it will be, so just match \S+, meaning "1 or more
nonspace characters".
"CRON" seems critical to you. You can match it literally just by writine
CRON.
Alternatively, you might want any service, not just cron, so you could
match a word ending in digits in brackets. Eg \S+\[\d+\] meaning "one or
more nonspace characters followed by a left square bracket followed by 1
or more digits followed by a right square bracket".
And so on.
You do not need to match the entire line. Just stop!
This lets you build up your regular expression progressively. Match the
first thing. When that's good, add some more pattern and test again.
Continue until you have matched everything that you need.
Your plan to use named section is good: surround the important pieces in
(?<name>
and
)
Then the match object will have these names pieces for use by name
later. See the Match.groupdict method. Example:
ptn = re,compile(r'your regexp in here')
m = ptn.match(your_input_line_here)
if not m:
print("NO MATCH")
else:
matches = m.groupdict()
# print the timestamp part of your match
print(matches['timestamp'])
So start slowly: write a regexp, with named parts, that just matches the
first thing. And print it by name as above. Then extend the expression
one part at a time until everything matches.
That way you only need to consider problems with the small thing you
have added.
Finally, note that most regexp patterns are "greedy". So .* will match
zero or more. But as many as possible.
You might be concerned that that would match the entire line of text.
Well it would, _except_ that it will only match stuff as long as the
rest of the pattern _also_ matches. So if using the whole line prevents
the rest of the pattern matching, it backs off a character and tries
again, making it shorter and shorter until the rest of the pattern does
match.
Cheers,
Cameron Simpson <cs at cskk.id.au>
More information about the Tutor
mailing list