Finding duplicate file names and modifying them based on elements of the path

Thu Jul 19 14:52:09 EDT 2012

On Jul 18, 4:49 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
> "Larry.Mart... at gmail.com" <larry.mart... at gmail.com> writes:
> > I have an interesting problem I'm trying to solve. I have a solution
> > almost working, but it's super ugly, and know there has to be a
> > better, cleaner way to do it. ...
>
> > My solution involves multiple maps and multiple iterations through the
> > data. How would you folks do this?
>
> You could post your code and ask for suggestions how to improve it.
> There are a lot of not-so-natural constraints in that problem, so it
> stands to reason that the code will be a bit messy.  The whole
> specification seems like an antipattern though.  You should just give a
> sensible encoding for the filename regardless of whether other fields
> are duplicated or not.  You also don't seem to address the case where
> basename, dir4, and dir5 are all duplicated.
>
> The approach I'd take for the spec as you wrote it is:
>
> 1. Sort the list on the (basename, dir4, dir5) triple, saving original
>    location (numeric index) of each item
> 2. Use itertools.groupby to group together duplicate basenames.
> 3. Within the groups, use groupby again to gather duplicate dir4's,
> 4. Within -those- groups, group by dir5 and assign sequence numbers in
>    groups where there's more than one file
> 5. Unsort to get the rewritten items back into the original order.
>
> Actual code is left as an exercise.

Thanks very much for the reply Paul. I did not know about itertools.
This seems like it will be perfect for me. But I'm having 1 issue, how
do I know how many of a given basename (and similarly how many
basename/dir4s) there are? I don't know that I have to modify a file
until I've passed it, so I have to do all kinds of contortions to save
the previous one, and deal with the last one after I fall out of the
loop, and it's getting very nasty.

reports_list is the list sorted on basename, dir4, dir5 (tool is dir4,
file_date is dir5):

for file, file_group in groupby(reports_list, lambda x: x[0]):
    # if file is unique in file_group do nothing, but how can I tell
if file is unique?
    for tool, tool_group in groupby(file_group, lambda x: x[1]):
        # if tool is unique for file, change file to tool_file, but
how can I tell if tool is unique for file?
        for file_date, file_date_group in groupby(tool_group, lambda
x: x[2]):

You can't do a len on the iterator that is returned from groupby, and
I've tried to do something with imap or      defaultdict, but I'm not
getting anywhere. I guess I can just make 2 passes through the data,
the first time getting counts. Or am I missing something about how
groupby works?

Thanks!
-larry