Data structure for plotting monotonically expanding data set

Edmondo Giovannozzi edmondo.giovannozzi at gmail.com
Thu May 27 11:55:11 EDT 2021


Il giorno giovedì 27 maggio 2021 alle 11:28:31 UTC+2 Loris Bennett ha scritto:
> Hi, 
> 
> I currently a have around 3 years' worth of files like 
> 
> home.20210527 
> home.20210526 
> home.20210525 
> ... 
> 
> so around 1000 files, each of which contains information about data 
> usage in lines like 
> 
> name kb 
> alice 123 
> bob 4 
> ... 
> zebedee 9999999 
> 
> (there are actually more columns). I have about 400 users and the 
> individual files are around 70 KB in size. 
> 
> Once a month I want to plot the historical usage as a line graph for the 
> whole period for which I have data for each user. 
> 
> I already have some code to extract the current usage for a single from 
> the most recent file: 
> 
> for line in open(file, "r"): 
> columns = line.split() 
> if len(columns) < data_column: 
> logging.debug("no. of cols.: %i less than data col", len(columns)) 
> continue 
> regex = re.compile(user) 
> if regex.match(columns[user_column]): 
> usage = columns[data_column] 
> logging.info(usage) 
> return usage 
> logging.error("unable to find %s in %s", user, file) 
> return "none" 
> 
> Obviously I will want to extract all the data for all users from a file 
> once I have opened it. After looping over all files I would naively end 
> up with, say, a nested dict like 
> 
> {"20210527": { "alice" : 123, , ..., "zebedee": 9999999}, 
> "20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9}, 
> "20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999}, 
> "20210524": { "alice" : 123, ..., "zebedee": 9}, 
> "20210523": { "alice" : 123, ..., "zebedee": 9999999}, 
> ...} 
> 
> where the user keys would vary over time as accounts, such as 'bob', are 
> added and latter deleted. 
> 
> Is creating a potentially rather large structure like this the best way 
> to go (I obviously could limit the size by, say, only considering the 
> last 5 years)? Or is there some better approach for this kind of 
> problem? For plotting I would probably use matplotlib. 
> 
> Cheers, 
> 
> Loris 
> 
> -- 
> This signature is currently under construction.

Have you tried to use pandas to read the data?
Then you may try to add a column with the date and then join the datasets.


More information about the Python-list mailing list