Suggested datatype for getting latest information from log files

Martin A. Brown martin at linux-ip.net
Thu Feb 11 13:58:33 EST 2016


Greetings,

>I have timestamped log files I need to read through and keep track 
>of the most upto date information.
>
>For example lets say we had a log file
>
>timeStamp,name,marblesHeld,timeNow,timeSinceLastEaten

I do not quite understand the distinction between timeStamp and 
timeNow.

>I need to keep track of every 'name' in this table, I don't want 
>duplicate values so if values come in from a later timestamp that 
>is different then that needs to get updated. For example if a later 
>timestamp showed 'dave' with less marbles that should get updated.
>
>I thought a dictionary would be a good idea because of the key 
>restrictions ensuring no duplicates, so the data would always 
>update - 

Yes.  A dictionary seems reasonable.

>However because they are unordered and I need to do some more 
>processing on the data afterwards I'm having trouble.

Ordered how?  For each name, you need to keep the stream of data 
ordered?  This is what I'm assuming based on your problem 
description.

If the order of names (dave, steve and jenny) is important, then you 
should look to OrderedDict as JM has suggested.

I am inferring from your description that the order of events (along 
a timeline) is what is important, not the sequence of players to 
each other(, since that is already in the logfile).

>For example lets assume that once I have the most upto date values 
>from dave,steve,jenny I wanted to do timeNow - timeSinceLastEaten 
>to get an interval then write all the info together to some other 
>database. Crucially order is important here.

Again, it's not utterly clear what "order" means.  If order of 
events for a single player is important, then see below.

>I don't know of a particular name will appear in the records or 
>not, so it needs to created on the first instance and updated from 
>then on.

Again, a dictionary is great for this.

It seems that you could benefit, also from a list (to store an event 
and the time at which the event occurred).  But, you don't want to 
store all of history, so you want to use a bounded length list.  You 
may find a collections.deque useful here.

>Could anyone suggest some good approaches or suggested data 
>structures for this?

First, JM already pointed you to OrderedDict, which may help 
depending on exactly what you are trying to order.

There are two other data structures in the collections module that 
may be helpful for you.  I perceive the following (from your 
description).

You have a set of names (players).
You wish to store, for each name, a value (marblesHeld).
You wish to store, for each name, a value (timeSinceLastEaten).

I recommend learning how to use both:

  collections.defaultdict [0]:  so you can dynamically create 
    entries for new players in the marble game without checking if 
    they already exist in the dictionary (very convenient!)

  collectionst.deque [1]:  in this case, I'm suggesting using it as 
    a bounded-length list; you keep adding stuff to it and after
    it stores X entries, the old ones will "fall off"

Note, I fabricated players and data, but the bit that you are 
probably interested in is the interaction between the dictionary, 
whose keys are the names of the players, and whose values contain 
the deque capturing (the last 10 entries) of the users marble count 
and the time at which this occurred.

  mydeque = functools.partial(collections.deque, maxlen=10)

  record = collections.defaultdict(mydeque)

Storing both the marble count and the time will allow you to
calculate at any time later the duration since the user last had a 
marble count change.

I don't understand how the eating fits into your problem, but maybe 
my code (below) will afford you an example of how to approach the 
problem with a few of Python's wonderfully convenient standard 
library data structures.

Good luck,

-Martin

P.S. I just read your reply to JM, and it looks like you also are 
trying to figure out how to read the input data.  Is it CSV?  Could 
you simply use the csv module [2]?

  [0] https://docs.python.org/3/library/collections.html#collections.defaultdict
  [1] https://docs.python.org/3/library/collections.html#collections.deque
  [2] https://docs.python.org/3/library/csv.html


#! /usr/bin/python3

import time
import random
import functools
import collections

import pprint

players = ['Steve', 'Jenny', 'Dave', 'Samuel', 'Jerzy', 'Ellen']
mydeque = functools.partial(collections.deque, maxlen=10)

def marblegame(rounds):
    record = collections.defaultdict(mydeque)
    for _ in range(rounds):
        now = time.time()
        who = random.choice(players)
        marbles = random.randint(0, 100)
        record[who].append((marbles, now))
    for whom, marblehistory in record.items():
        print(whom, end=": ")
        pprint.pprint(marblehistory)

if __name__ == '__main__':
    import sys
    if len(sys.argv) > 1:
        count = int(sys.argv[1])
    else:
        count = 30
    marblegame(count)

# -- end of file

-- 
Martin A. Brown
http://linux-ip.net/



More information about the Python-list mailing list