[Tutor] How to list/process files with identical character strings

Peter Otten __peter__ at web.de
Wed Jun 25 09:15:20 CEST 2014


Alex Kleider wrote:

> On 2014-06-24 14:01, mark murphy wrote:
>> Hi Danny, Marc, Peter and Alex,
>> 
>> Thanks for the responses!  Very much appreciated.
>> 
>> I will take these pointers and see what I can pull together.
>> 
>> Thanks again to all of you for taking the time to help!
> 
> 
> Assuming your files are ordered and therefore one's that need to be
> paired will be next to each other,
> and that you can get an ordered listing of their names,
> here's a suggestion as to the sort of thing that might work:
> 
> f2process = None
> for fname in listing:
>      if not f2process:
>          f2process = fname
>      elif to_be_paired(f2process, fname):
>          process(marry(f2process, fname))
>          already_processed = fname
>          f2process = None
>      else:
>          process(f2process)
>          already_processed = fname
>          f2process = fname
> 
> if fname != already_processed:
>      # I'm not sure if 'fname' survives the for/in statement.
>      # If it doesn't, another approach to not loosing the last file will
> be required.
>      # I hope those more expert will comment.
>      process(fname)
> 
> 
> def to_be_paired(f1, f2):
>      """Returns a boolean: true if the files need to be amalgamated."""
>      pass  # your code goes here.
> 
> def marry(f1, f2):
>      """Returns a file object which is a combination of the two files
> named by f1 and f2."""
>      pass  # your code here.
> 
> def process(fname_or_object):
>      """Accepts either a file name or a file object, Does what you want
> done."""
>      pass  # your code here.
> 
> Comments?
> I was surprised that the use of dictionaries was suggested, especially
> since we were told there were many many files.

(1) 10**6 would be "many files" as in "I don't want to touch them manually",
but no problem for the dict approach. "a directory of several thousand daily 
satellite images" should certainly be managable.

(2a) os.listdir() returns a list, so you consume memory proportional to the
number of files anyway.

(2b) Even if you replace listdir() with a function that generates one 
filename at a time you cannot safely assume that the names are sorted 
-- you have to put them in a list to sort them.

(3a) Dictionaries are *the* data structure in Python. You should rather be 
surprised when dict is not proposed for a problem. I might go as far as to 
say that most of the Python language is syntactic sugar for dicts ;) This 
leads to

(3b) dict-based solutions are usually both efficient and 

(3c) concise

To back 3c here's how I would have written the code if it weren't for 
educational purposes:

directory = "some/directory"
files = os.listdir(directory)
days = collections.defaultdict(list)

for filename in files:
    days[filename[:8]].append(os.path.join(directory, filename))

for fileset in days.values():
    if len(fileset) > 1:
        print("merging", fileset)

But I admit that sort/groupby is also fine:

directory = "some/directory"
files = os.listdir(directory)
files.sort()

for _prefix, fileset in itertools.groupby(files, key=lambda name: name[:8]):
    fileset = list(fileset)
    if len(fileset) > 1:
        print("merging", [os.path.join(directory, name) for name in 
fileset])




More information about the Tutor mailing list