[Tutor] Python Regex re.search() to parse system logs

Mike Wilbur wilbur6453 at gmail.com
Fri Dec 25 10:36:10 EST 2020


Cameron,
Thank you for the help.  I've read through it and I'm understanding the
logic!  I can't believe I wasn't using "r" for raw strings.  I'll let you
know how it all goes but I'm confident.
Again, thanks
Mike

On Tue, Dec 22, 2020 at 1:06 AM Cameron Simpson <cs at cskk.id.au> wrote:

> Comments below...
>
> On 21Dec2020 17:48, Mike Wilbur <wilbur6453 at gmail.com> wrote:
> >print(show_time_of_pid("Jul 6 14:01:23 computer.name CRON[29440]: USER
> >(good_user)")) # call to run function with parameter
> ># Desired output per below:
> ># Jul 6 14:01:23 pid:29440
> >
> >My code so far keeps pulling in the string "computer.name CRON[".  I can
> >get the date & time OR the pid #.  But not by themselves.  I have not
> >looked at adding the "pid:" to the output yet.
> >
> >*My code:*
> >print(re.search("(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})", "Jul 6 14:01:23
> >computer.name CRON[29440]: USER (good_user)"))
>
> A recommendation: use "raw strings" when writing regexps:
>
>    r"(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})"
>
> or:
>
>    r'(^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})'
>
> That leading "r" marks this as a "raw string", which means in particular
> that backslahses are not special to Python. Since regexps use
> backslashes to represent character classes and other things, using a raw
> string prevents Python's own backslash stuff from getting in the way.
> You will have fewer accidents this way.
>
> ><re.Match object; span=(0, 39), match='Jul 6 14:01:23 computer.name
> >CRON[29440'>
> >
> >Produced code using group names that isolates desired output.  But this
> >will not work with re.search() I believe.
>
> It works just find with re.search. re.match and re.search both return
> "Match" objects, they just differ in where they start in the text.
>
> >I think I'd need to use re.sub() instead.
>
> No need.
>
> Let's look at your regexp (ignoring the quotes - they're just for
> Python):
>
>    (^[\w \:]{15}.*[^a-z\.CRON][0-9]{5})
>
> Brackets () in a regexp group a section as a single unit. You don't need
> brackets around the whole thing.
>
>    ^[\w \:]{15}.*[^a-z\.CRON][0-9]{5}
>
> Let's look at each part:
>
> ^           Start of string.
>
> [\w \:]     A single character which is a "word" character or a space or
>             a colon.
>
> {15}        Exactly 15 such characters.
>
> .*          Any number of characters (zero or more of '.', which is any
>             single character).
>
> [^a-z\.CRON] A single character which is not one of a-z, ., C, R, O, N.
>
> [0-9]       A digit. Which can also be written \d
> {5}         Exactly 5 such characters, so exactly 5 digits.
>
> I think your "CRON" above should be _outside_ the [] character range.
>
> I recommend starting with a sample input line and deciding how to match
> each piece alone. You often have a choice here - take the simplest
> choice available.
>
> So:
>
>     Jul 6 14:01:23 computer.name CRON[29440]: USER (good_user)
>
> Your "[\w :]{15}" looks good. Dates in these logs are a fixed length and
> this will be reliable.
>
> After that I tend to be less believing. So I'd match the spaces with
> \s+, meaning "1 or more space characters".
>
> A computer name may have several characters, but won't have whitespace.
> You know where it will be, so just match \S+, meaning "1 or more
> nonspace characters".
>
> "CRON" seems critical to you. You can match it literally just by writine
> CRON.
>
> Alternatively, you might want any service, not just cron, so you could
> match a word ending in digits in brackets. Eg \S+\[\d+\] meaning "one or
> more nonspace characters followed by a left square bracket followed by 1
> or more digits followed by a right square bracket".
>
> And so on.
>
> You do not need to match the entire line. Just stop!
>
> This lets you build up your regular expression progressively. Match the
> first thing. When that's good, add some more pattern and test again.
> Continue until you have matched everything that you need.
>
> Your plan to use named section is good: surround the important pieces in
>
>     (?<name>
>
> and
>
>     )
>
> Then the match object will have these names pieces for use by name
> later. See the Match.groupdict method. Example:
>
>     ptn = re,compile(r'your regexp in here')
>     m = ptn.match(your_input_line_here)
>     if not m:
>         print("NO MATCH")
>     else:
>         matches = m.groupdict()
>         # print the timestamp part of your match
>         print(matches['timestamp'])
>
> So start slowly: write a regexp, with named parts, that just matches the
> first thing. And print it by name as above. Then extend the expression
> one part at a time until everything matches.
>
> That way you only need to consider problems with the small thing you
> have added.
>
> Finally, note that most regexp patterns are "greedy". So .* will match
> zero or more. But as many as possible.
>
> You might be concerned that that would match the entire line of text.
> Well it would, _except_ that it will only match stuff as long as the
> rest of the pattern _also_ matches. So if using the whole line prevents
> the rest of the pattern matching, it backs off a character and tries
> again, making it shorter and shorter until the rest of the pattern does
> match.
>
> Cheers,
> Cameron Simpson <cs at cskk.id.au>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>


More information about the Tutor mailing list