Parse ASCII log ; sort and keep most recent entries

Terry Reedy tjreedy at udel.edu
Thu Jun 17 11:35:44 EDT 2004


"Nova's Taylor" <novastaylor at hotmail.com> wrote in message
news:fda4b581.0406161306.c5de18f at posting.google.com...
> The log is an ASCII file that contains a process identifier (PID),
> username, date, and time field like this:
>
> 1234 williamstim 01AUG03 7:44:31
> 2348 williamstim 02AUG03 14:11:20
> 23 jonesjimbo 07AUG03 15:25:00
> 2348 williamstim 17AUG03 9:13:55
> 748 jonesjimbo 13OCT03 14:10:05
> 23 jonesjimbo 14OCT03 23:01:23
> 748 jonesjimbo 14OCT03 23:59:59

If you can get the log writer to write fixed length records with everything
lined up nicely, it would be easier to read the log by eye (with fixed
pitch font, which my newsreader doesn't use).  It is also then trivial to
slice a field out of the middle of the line.

If one wants/needs to sort records by date, life is also easier if you can
get the record writer to print dates in sortable format: YYYYMMDD.  (I
learned this 25 years ago.)

> I want to read in and sort the file so the new list only contains only
> the most the most recent PID (PIDS get reused often).

If these are *nix process ids, this does not make obvious sense.  Since
pids are arbitrary, why delete a recent record because its PID got reused
while keeping an old record because its PID happended not to?  I could
better imagine keeping all records since a certain date or the last n
records (the latter is trivial with fixed len records).

> In my example, the new list would be:
>
> 1234 williamstim 01AUG03 7:44:31
> 2348 williamstim 17AUG03 9:13:55
> 23 jonesjimbo 14OCT03 23:01:23
> 748 jonesjimbo 14OCT03 23:59:59
>
> So I need to sort by PID and date + time,then keep the most recent.

That is one possibility: you have form a list of (key, line) pairs, where
key is extracted from the line.

> Any help would be appreciated!

Alternative: instead of sort then filter duplicates, filter duplicates and
then sort the reduced list.  Assuming records are in date order from
earlier to later, insert them into a dict with PID as key and entire record
as value, and later records will replace earlier records with same key
(PID).  Then resort d.values() by date.  Variation: if you cannot get dates
stored properly for easy sorting, store line numbers with records so you
can sort by line number instead of fiddling with nasty dates.  Something
like (incomplete and untested):

d = {}
for pair in enumerate(file('whatever')):
    d[getpid(pair[1])] = pair # getpid might be inline expression
uniqs = d.values()
uniqs.sort()
new = [pair[1] for pair in uniqs]

Terry J. Reedy







More information about the Python-list mailing list