Data structure for plotting monotonically expanding data set

Fri Jun 4 09:47:48 EDT 2021

I agree with dn. While you could scrape the text files each time you want
to display a user from a design perspective it makes more sense to use a
database to store the data. This doesn't mean that you need to get rid of
the text files or change the format that they are written to but instead
that you just make a small ingest process that adds the data to the
database at a regular interval, or it can even be triggered manually by
you. You don't mention how many columns you have but even if it is a large
number, say 100 I would recommend that you look into SQLite. It is a file
based database that is widely used and supported and has a module that is
included in core Python. As dn noted this will make retrieving the data
much easier when you want to plot it. Additionally you don't mention how
you want to display the information but if you want to use a web based
application having your data in a database is going to make integration
easier there as well. If you have all your data in a SQLite database you
should be able to easily build a dashboard in Flask or in an Jupyter
Notebook, or you will likely be able to take advantage of third party tools
like Kibana. I know it may seem like more work now but getting your data
into a database will pay off big in the long run.

Chris

On Thu, May 27, 2021 at 8:43 PM dn via Python-list <python-list at python.org>
wrote:

> On 27/05/2021 21.28, Loris Bennett wrote:
> > Hi,
> >
> > I currently a have around 3 years' worth of files like
> >
> >   home.20210527
> >   home.20210526
> >   home.20210525
> >   ...
> >
> > so around 1000 files, each of which contains information about data
> > usage in lines like
> >
> >   name    kb
> >   alice   123
> >   bob     4
> >   ...
> >   zebedee 9999999
> >
> > (there are actually more columns).  I have about 400 users and the
> > individual files are around 70 KB in size.
> >
> > Once a month I want to plot the historical usage as a line graph for the
> > whole period for which I have data for each user.
> >
> > I already have some code to extract the current usage for a single from
> > the most recent file:
> >
> >     for line in open(file, "r"):
> >         columns = line.split()
> >         if len(columns) < data_column:
> >             logging.debug("no. of cols.: %i less than data col",
> len(columns))
> >             continue
> >         regex = re.compile(user)
> >         if regex.match(columns[user_column]):
> >             usage = columns[data_column]
> >             logging.info(usage)
> >             return usage
> >     logging.error("unable to find %s in %s", user, file)
> >     return "none"
> >
> > Obviously I will want to extract all the data for all users from a file
> > once I have opened it.  After looping over all files I would naively end
> > up with, say, a nested dict like
> >
> >     {"20210527": { "alice" : 123, , ..., "zebedee": 9999999},
> >      "20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9},
> >      "20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999},
> >      "20210524": { "alice" : 123, ..., "zebedee": 9},
> >      "20210523": { "alice" : 123, ..., "zebedee": 9999999},
> >      ...}
> >
> > where the user keys would vary over time as accounts, such as 'bob', are
> > added and latter deleted.
> >
> > Is creating a potentially rather large structure like this the best way
> > to go (I obviously could limit the size by, say, only considering the
> > last 5 years)?  Or is there some better approach for this kind of
> > problem?  For plotting I would probably use matplotlib.
>
>
> NB I am predisposed to use databases. People without such skills will
> likely feel the time-and-effort investment to learn uneconomic for such
> a simple, single, example!
>
>
> Because the expressed concern seems to be the size of the data-set, (one
> assumes) only certain users' data will be graphed (at one time). Another
> concern may be that every time the routine executes, it repeats the bulk
> of its regex-based processing.
>
> I would establish a DB with (at least, as appropriate) two tables: one
> the list of files from which the data has been extracted, and the second
> containing the data currently formatted as a dict. NB The second may
> benefit from stating in "normal form" or splitting into related tables,
> and certainly indexing.
>
> Thus the process requires two steps: firstly to capture the data (from
> the files) into the DB, and secondly to graph the appropriate groups or
> otherwise 'chosen' users.
>
> SQL will simplify data retrieval, and feeding into matplotlib (or
> whichever tool). It will also enable simple improvements both to select
> sub-sets of users or to project over various periods of time.
>
> YMMV!
> --
> Regards,
> =dn
> --
> https://mail.python.org/mailman/listinfo/python-list
>