generating list of files matching condition

Thu Nov 24 04:18:21 EST 2016

Seb wrote:

> Hello,
> 
> Given a list of files:
> 
> In [81]: ec_files[0:10]
> Out[81]:
> 
> [u'EC_20160604002000.csv',
>  u'EC_20160604010000.csv',
>  u'EC_20160604012000.csv',
>  u'EC_20160604014000.csv',
>  u'EC_20160604020000.csv']
> 
> where the numbers are are a timestamp with format %Y%m%d%H%M%S, I'd like
> to generate a list of matching files for each 2-hr period in a 2-h
> frequency time series.  Ultimately I'm using Pandas to read and handle
> the data in each group of files.  For the task of generating the files
> for each 2-hr period, I've done the following:
> 
> beg_tstamp = pd.to_datetime(ec_files[0][-18:-4],
>                             format="%Y%m%d%H%M%S")
> end_tstamp = pd.to_datetime(ec_files[-1][-18:-4],
>                             format="%Y%m%d%H%M%S")
> tstamp_win = pd.date_range(beg_tstamp, end_tstamp, freq="2H")
> 
> So tstamp_win is the 2-hr frequency time series spanning the timestamps
> in the files in ec_files.
> 
> I've generated the list of matching files for each tstamp_win using a
> comprehension:
> 
> win_files = []
> for i, w in enumerate(tstamp_win):
>     nextw = w + pd.Timedelta(2, "h")
>     ifiles = [x for x in ec_files if
>               pd.to_datetime(x[-18:-4], format="%Y%m%d%H%M%S") >= w and
>               pd.to_datetime(x[-18:-4], format="%Y%m%d%H%M%S") < nextw]
>     win_files.append(ifiles)
> 
> However, this is proving very slow, and was wondering whether there's a
> better/faster way to do this.  Any tips would be appreciated.

Is win_files huge? Then it might help to avoid going over the entire list 
for every interval. Instead you can sort the list and then add to the 
current list while you are below nextw.

My pandas doesn't seem to have Timedelta (probably it's too old), so here's 
a generic solution using only the stdlib:

$ cat group_2hours.py
import itertools
import datetime
import pprint

def filename_to_time(filename):
    return datetime.datetime.strptime(filename[-18:-4], "%Y%m%d%H%M%S")

def make_key(delta_t):
    upper_bound = None
    def key(filename):
        nonlocal upper_bound

        if upper_bound is None:
            upper_bound = filename_to_time(filename) + delta_t
        else:
            t = filename_to_time(filename)
            while t >= upper_bound: # needs work if there are large gaps
                upper_bound += delta_t
        return upper_bound
    return key

ec_files = [
    u'EC_20160604002000.csv',
    u'EC_20160604010000.csv',
    u'EC_20160604012000.csv',
    u'EC_20160604014000.csv',
    u'EC_20160604020000.csv',
    u'EC_20160604050000.csv',
    u'EC_20160604060000.csv',
    u'EC_20160604070000.csv',
]
ec_files.sort() # ensure filenames are in ascending order

TWO_HOURS = datetime.timedelta(hours=2)

win_files = [
    list(group) for _key, group
    in itertools.groupby(ec_files, key=make_key(TWO_HOURS))
]

pprint.pprint(win_files)
$ python3 group_2hours.py 
[['EC_20160604002000.csv',
  'EC_20160604010000.csv',
  'EC_20160604012000.csv',
  'EC_20160604014000.csv',
  'EC_20160604020000.csv'],
 ['EC_20160604050000.csv', 'EC_20160604060000.csv'],
 ['EC_20160604070000.csv']]
$ 

PS: If the files' prefixes differ you cannot sort by name. Instead use

ec_files.sort(key=filename_to_time)

PPS: There is probably a way to do this by converting the list to a pandas 
dataframe; it might be worthwhile to ask in a specialised forum.