[Tutor] Python Regex re.search() to parse system logs

Tue Dec 22 01:06:06 EST 2020

Comments below...

On 21Dec2020 17:48, Mike Wilbur <wilbur6453 at gmail.com> wrote:
>print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER
>(good_user)")) # call to run function with parameter
># Desired output per below:
># Jul 6 14:01:23 pid:29440
>
>My code so far keeps pulling in the string "computer.name CRON[".  I can
>get the date & time OR the pid #.  But not by themselves.  I have not
>looked at adding the "pid:" to the output yet.
>
>*My code:*
>print(re.search("(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})", "Jul 6 14:01:23
>computer.name CRON[29440]: USER (good_user)"))

A recommendation: use "raw strings" when writing regexps:

   r"(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})"

or:

   r'(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})'

That leading "r" marks this as a "raw string", which means in particular 
that backslahses are not special to Python. Since regexps use 
backslashes to represent character classes and other things, using a raw 
string prevents Python's own backslash stuff from getting in the way.  
You will have fewer accidents this way.

><re.Match object; span=(0, 39), match='Jul 6 14:01:23 computer.name
>CRON[29440'>
>
>Produced code using group names that isolates desired output.  But this
>will not work with re.search() I believe.

It works just find with re.search. re.match and re.search both return 
"Match" objects, they just differ in where they start in the text.

>I think I'd need to use re.sub() instead.

No need.

Let's look at your regexp (ignoring the quotes - they're just for 
Python):

   (^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})

Brackets () in a regexp group a section as a single unit. You don't need 
brackets around the whole thing.

   ^[\w \:]{15}.*[^a-z\.CRON][0-9]{5}

Let's look at each part:

^           Start of string.

[\w \:]     A single character which is a "word" character or a space or
            a colon.

{15}        Exactly 15 such characters.

.*          Any number of characters (zero or more of '.', which is any
            single character).

[^a-z\.CRON] A single character which is not one of a-z, ., C, R, O, N.

[0-9]       A digit. Which can also be written \d
{5}         Exactly 5 such characters, so exactly 5 digits.

I think your "CRON" above should be _outside_ the [] character range.

I recommend starting with a sample input line and deciding how to match 
each piece alone. You often have a choice here - take the simplest 
choice available.

So:

    Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)

Your "[\w :]{15}" looks good. Dates in these logs are a fixed length and 
this will be reliable.

After that I tend to be less believing. So I'd match the spaces with 
\s+, meaning "1 or more space characters".

A computer name may have several characters, but won't have whitespace.  
You know where it will be, so just match \S+, meaning "1 or more 
nonspace characters".

"CRON" seems critical to you. You can match it literally just by writine 
CRON.

Alternatively, you might want any service, not just cron, so you could 
match a word ending in digits in brackets. Eg \S+\[\d+\] meaning "one or 
more nonspace characters followed by a left square bracket followed by 1 
or more digits followed by a right square bracket".

And so on.

You do not need to match the entire line. Just stop!

This lets you build up your regular expression progressively. Match the 
first thing. When that's good, add some more pattern and test again.  
Continue until you have matched everything that you need.

Your plan to use named section is good: surround the important pieces in

    (?<name>

and

    )

Then the match object will have these names pieces for use by name 
later. See the Match.groupdict method. Example:

    ptn = re,compile(r'your regexp in here')
    m = ptn.match(your_input_line_here)
    if not m:
        print("NO MATCH")
    else:
        matches = m.groupdict()
        # print the timestamp part of your match
        print(matches['timestamp'])

So start slowly: write a regexp, with named parts, that just matches the 
first thing. And print it by name as above. Then extend the expression 
one part at a time until everything matches.

That way you only need to consider problems with the small thing you 
have added.

Finally, note that most regexp patterns are "greedy". So .* will match 
zero or more. But as many as possible.

You might be concerned that that would match the entire line of text.  
Well it would, _except_ that it will only match stuff as long as the 
rest of the pattern _also_ matches. So if using the whole line prevents 
the rest of the pattern matching, it backs off a character and tries 
again, making it shorter and shorter until the rest of the pattern does 
match.

Cheers,
Cameron Simpson <cs at cskk.id.au>