Data structure for plotting monotonically expanding data set

Thu May 27 20:40:24 EDT 2021

On 27/05/2021 21.28, Loris Bennett wrote:
> Hi,
> 
> I currently a have around 3 years' worth of files like
> 
>   home.20210527
>   home.20210526
>   home.20210525
>   ...
> 
> so around 1000 files, each of which contains information about data
> usage in lines like
> 
>   name    kb
>   alice   123
>   bob     4
>   ...
>   zebedee 9999999
> 
> (there are actually more columns).  I have about 400 users and the
> individual files are around 70 KB in size.
> 
> Once a month I want to plot the historical usage as a line graph for the
> whole period for which I have data for each user.
> 
> I already have some code to extract the current usage for a single from
> the most recent file:
> 
>     for line in open(file, "r"):
>         columns = line.split()
>         if len(columns) < data_column:
>             logging.debug("no. of cols.: %i less than data col", len(columns))
>             continue
>         regex = re.compile(user)
>         if regex.match(columns[user_column]):
>             usage = columns[data_column]
>             logging.info(usage)
>             return usage
>     logging.error("unable to find %s in %s", user, file)
>     return "none"
> 
> Obviously I will want to extract all the data for all users from a file
> once I have opened it.  After looping over all files I would naively end
> up with, say, a nested dict like
> 
>     {"20210527": { "alice" : 123, , ..., "zebedee": 9999999},
>      "20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9},
>      "20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999},
>      "20210524": { "alice" : 123, ..., "zebedee": 9},
>      "20210523": { "alice" : 123, ..., "zebedee": 9999999},
>      ...}
> 
> where the user keys would vary over time as accounts, such as 'bob', are
> added and latter deleted.
> 
> Is creating a potentially rather large structure like this the best way
> to go (I obviously could limit the size by, say, only considering the
> last 5 years)?  Or is there some better approach for this kind of
> problem?  For plotting I would probably use matplotlib.

NB I am predisposed to use databases. People without such skills will
likely feel the time-and-effort investment to learn uneconomic for such
a simple, single, example!

Because the expressed concern seems to be the size of the data-set, (one
assumes) only certain users' data will be graphed (at one time). Another
concern may be that every time the routine executes, it repeats the bulk
of its regex-based processing.

I would establish a DB with (at least, as appropriate) two tables: one
the list of files from which the data has been extracted, and the second
containing the data currently formatted as a dict. NB The second may
benefit from stating in "normal form" or splitting into related tables,
and certainly indexing.

Thus the process requires two steps: firstly to capture the data (from
the files) into the DB, and secondly to graph the appropriate groups or
otherwise 'chosen' users.

SQL will simplify data retrieval, and feeding into matplotlib (or
whichever tool). It will also enable simple improvements both to select
sub-sets of users or to project over various periods of time.

YMMV!
-- 
Regards,
=dn