Data structure for plotting monotonically expanding data set

Loris Bennett loris.bennett at
Thu May 27 05:28:11 EDT 2021


I currently a have around 3 years' worth of files like


so around 1000 files, each of which contains information about data
usage in lines like

  name    kb
  alice   123
  bob     4
  zebedee 9999999

(there are actually more columns).  I have about 400 users and the
individual files are around 70 KB in size.

Once a month I want to plot the historical usage as a line graph for the
whole period for which I have data for each user.

I already have some code to extract the current usage for a single from
the most recent file:

    for line in open(file, "r"):
        columns = line.split()
        if len(columns) < data_column:
            logging.debug("no. of cols.: %i less than data col", len(columns))
        regex = re.compile(user)
        if regex.match(columns[user_column]):
            usage = columns[data_column]
            return usage
    logging.error("unable to find %s in %s", user, file)
    return "none"

Obviously I will want to extract all the data for all users from a file
once I have opened it.  After looping over all files I would naively end
up with, say, a nested dict like

    {"20210527": { "alice" : 123, , ..., "zebedee": 9999999},
     "20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9},
     "20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999},
     "20210524": { "alice" : 123, ..., "zebedee": 9},
     "20210523": { "alice" : 123, ..., "zebedee": 9999999},

where the user keys would vary over time as accounts, such as 'bob', are
added and latter deleted.

Is creating a potentially rather large structure like this the best way
to go (I obviously could limit the size by, say, only considering the
last 5 years)?  Or is there some better approach for this kind of
problem?  For plotting I would probably use matplotlib.



This signature is currently under construction.

More information about the Python-list mailing list