Finding duplicate file names and modifying them based on elements of the path

Paul Rubin no.email at nospam.invalid
Thu Jul 19 15:43:03 EDT 2012


"Larry.Martell at gmail.com" <larry.martell at gmail.com> writes:
> Thanks for the reply Paul. I had not heard of itertools. It sounds
> like just what I need for this. But I am having 1 issue - how do you
> know how many items are in each group?

Simplest is:

  for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
     gs = list(group)  # convert iterator to a list
     n = len(gs)       # this is the number of elements

there is some theoretical inelegance in that it requires each group to
fit in memory, but you weren't really going to have billions of files
with the same basename.

If you're not used to iterators and itertools, note there are some
subtleties to using groupby to iterate over files, because an iterator
actually has state.  It bumps a pointer and maybe consumes some input
every time you advance it.  In a situation like the above, you've got
some nexted iterators (the groupby iterator generating groups, and the
individual group iterators that come out of the groupby) that wrap the
same file handle, so bad confusion can result if you advance both
iterators without being careful (one can consume file input that you
thought would go to another).

This isn't as bad as it sounds once you get used to it, but it can be
a source of frustration at first.  

BTW, if you just want to count the elements of an iterator (while
consuming it),

     n = sum(1 for x in xs)

counts the elements of xs without having to expand it into an in-memory
list.

Itertools really makes Python feel a lot more expressive and clean,
despite little kinks like the above.



More information about the Python-list mailing list