Data structure for plotting monotonically expanding data set

Sat Jun 5 10:24:10 EDT 2021

One way to go is using Pandas as it was mentioned before and Seaborn for 
plotting (built on top of matplotlib)

I would approach this prototyping first with a single file and not with 
the 1000 files that you have.

Using the code that you have for parsing, add the values to a Pandas 
DataFrame (aka, a table).

# load pandas and create a 'date' object to represent the file date
# You'll have "pip install pandas" to use it
import pandas as pd

file_date = pd.to_datetime('20210527')

# data that you parsed as list of lists with each list being
# each line in your file.
data = [
     ["alice", 123, file_date],
     ["bob", 4, file_date],
     ["zebedee", 9999999, file_date]
     ]

# then, load it as a pd.DataFrame
df = pd.DataFrame(data, columns=['name', 'kb', 'date'])

# print it
print(df)
             name       kb       date
       0    alice      123 2021-05-27
       1      bob        4 2021-05-27
       2  zebedee  9999999 2021-05-27

Now, this is the key point: You can save the dataframe in a file
so you don't have to process the same file over and over.

Pandas has different formats, some are more suitable than others.

# I'm going to use "parquet" format which compress really well
# and it is quite fast. You'll have "pip install pyarrow" to use it
df.to_parquet('df-20210527.pq')

Now you repeat this for all your files so you will end up with ~1000 
parquet files.

So, let's say that you want to plot some lines. You'll need to load 
those dataframes from disk.

You read each file, get a Pandas DataFrame for each and then
"concatenate" them into a single Pandas DataFrame

all_dfs = [pd.read_parquet(<filename>) for <filename> in <...>]
df = pd.concat(all_dfs, ignore_index=True)

Now, the plotting part. You said that you wanted to use matplotlib. I'll 
go one step forward and use seaborn (which it is implemented on top of 
matplotlib)

import matplotlib.pyplot as plt
import seaborn as sns

# plot the mean of 'kb' per date as a point. Per each point
# plot a vertical line showing the "spread" of the values and connect
# the points with lines to show the slope (changes) between days
sns.pointplot(data=df, x="date", y="kb")
plt.show()

# plot the distribution of the 'kb' values per each user 'name'.
sns.violinplot(data=df, x="name", y="kb")
plt.show()

# plot the 'kb' per day for the 'alice' user
sns.lineplot(data=df.query('name == "alice"'), x="date", y="kb")
plt.show()

That's all, a very quick intro to Pandas and Seaborn.

Enjoy the hacking.

Thanks,
Martin.

On Thu, May 27, 2021 at 08:55:11AM -0700, Edmondo Giovannozzi wrote:
>Il giorno giovedì 27 maggio 2021 alle 11:28:31 UTC+2 Loris Bennett ha scritto:
>> Hi,
>>
>> I currently a have around 3 years' worth of files like
>>
>> home.20210527
>> home.20210526
>> home.20210525
>> ...
>>
>> so around 1000 files, each of which contains information about data
>> usage in lines like
>>
>> name kb
>> alice 123
>> bob 4
>> ...
>> zebedee 9999999
>>
>> (there are actually more columns). I have about 400 users and the
>> individual files are around 70 KB in size.
>>
>> Once a month I want to plot the historical usage as a line graph for the
>> whole period for which I have data for each user.
>>
>> I already have some code to extract the current usage for a single from
>> the most recent file:
>>
>> for line in open(file, "r"):
>> columns = line.split()
>> if len(columns) < data_column:
>> logging.debug("no. of cols.: %i less than data col", len(columns))
>> continue
>> regex = re.compile(user)
>> if regex.match(columns[user_column]):
>> usage = columns[data_column]
>> logging.info(usage)
>> return usage
>> logging.error("unable to find %s in %s", user, file)
>> return "none"
>>
>> Obviously I will want to extract all the data for all users from a file
>> once I have opened it. After looping over all files I would naively end
>> up with, say, a nested dict like
>>
>> {"20210527": { "alice" : 123, , ..., "zebedee": 9999999},
>> "20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9},
>> "20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999},
>> "20210524": { "alice" : 123, ..., "zebedee": 9},
>> "20210523": { "alice" : 123, ..., "zebedee": 9999999},
>> ...}
>>
>> where the user keys would vary over time as accounts, such as 'bob', are
>> added and latter deleted.
>>
>> Is creating a potentially rather large structure like this the best way
>> to go (I obviously could limit the size by, say, only considering the
>> last 5 years)? Or is there some better approach for this kind of
>> problem? For plotting I would probably use matplotlib.
>>
>> Cheers,
>>
>> Loris
>>
>> --
>> This signature is currently under construction.
>
>Have you tried to use pandas to read the data?
>Then you may try to add a column with the date and then join the datasets.
>-- 
>https://mail.python.org/mailman/listinfo/python-list