[Tutor] Find duplicates (using dictionaries)

Rich Lovely roadierich at googlemail.com
Wed Feb 17 19:08:34 CET 2010


On 17 February 2010 16:31, Karjer Jdfjdf <karper12345 at yahoo.com> wrote:

> I'm relatively new at Python and I'm trying to write a function that fills
> a dictionary acording the following rules and (example) data:
>
> Rules:
> * No duplicate values in field1
> * No duplicates values in field2 and field3 simultaneous (highest value in
> field4 has to be preserved)
>
>
> Rec.no field1, field2, field3, field4
> 1. abc, def123, ghi123, 120 <-- new, insert in dictionary
> 2. abc, def123, ghi123, 120 <-- duplicate with 1. field4 same value. Do not
> insert in dictionary
> 3. bcd, def123, jkl125, 154 <-- new, insert in dictionary
> 4. efg, def123, jkl125, 175 <-- duplicate with 3 in field 2 and 3, but
> higher value in field4. Remove 3. from dict and replace with 4.
> 5. hij, ghi345, jkl125, 175 <-- duplicate field3, but not in field4. New,
> insert in dict.
>
>
> The resulting dictionary should be:
>
> hij     {'F2': ' ghi345', 'F3': ' jkl125', 'F4': 175}
> abc     {'F2': ' def123', 'F3': ' ghi123', 'F4': 120}
> efg     {'F2': ' def123', 'F3': ' jkl125', 'F4': 175}
>
> This is wat I came up with up to now, but there is something wrong with it.
> The 'bcd' should have been removed. When I run it it says:
>
> bcd     {'F2': ' def123', 'F3': ' jkl125', 'F4': 154}
> hij     {'F2': ' ghi345', 'F3': ' jkl125', 'F4': 175}
> abc     {'F2': ' def123', 'F3': ' ghi123', 'F4': 120}
> efg     {'F2': ' def123', 'F3': ' jkl125', 'F4': 175}
>
> Below is wat I brew (simplified). It took me some time to figure out that I
> was looking at the wrong values the wrong dictionary. I started again, but
> am ending up with a lot of dictionaries and for x in y-loops. I think there
> is a simpler way to do this.
>
> Can somebody point me in the right direction and explain to me how to do
> this? (and maybe have an alternative for the nesting. Because I may need to
> compare more fields. This is only a simplified dataset).
>
>
> ######### not working
> def createResults(field1, field2, field3, field4):
>         #check if field1 exists.
>                 if not results.has_key(field1):
>
>                         if results.has_key(field2):
>                                 #check if field2 already exists
>
>                                 if results.has_key(field3):
>                                     #check if field3 already exists
>                                     #retrieve value field4
>                                     existing_field4 = results[field2][F4]
>                                     #retrieve value existing field1 in dict
>                                     existing_field1 = results[field1]
>
>                                     #perform highest value check
>                                     if int(existing_field4) < int(field4):
>                                         #remove existing record from dict.
>                                         del results[existing_field1]
>                                         values = {}
>                                         values['F2'] = field2
>                                         values['F3'] = field3
>                                         values['F4'] = field4
>                                         results[field1] = values
>                                     else:
>                                         pass
>                                 else:
>                                     pass
>                         else:
>                             values = {}
>                             values['F2'] = field2
>                             values['F3'] = field3
>                             values['F4'] = field4
>                             results[field1] = values
>                 else:
>                     pass
>
>
>
>
>
> for line in open("file.csv"):
>         field1, field2, field3, field4 = line.split(',')
>         createResults(field1, field2, field3, int(field4))
>     #because this is quick and dirty I had to get rid of the \n in the csv
>
> for i in results.keys():
>         print i, '\t', results[i]
> ################
>
> contents file.csv
>
> abc, def123, ghi123, 120
> abc, def123, ghi123, 120
> bcd, def123, jkl125, 154
> efg, def123, jkl125, 175
> hij, ghi345, jkl125, 175
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
>
First observation: This strikes me as a poor use for dictionaries.  You
might be better using a list of tuples.
If you will always have four fields in each line of data, then there is no
need to check for the existence of the other elements, so you can cut out a
lot of the checks.

Whilst your requirements are not exactly clear to me, here is how I would do
what it sounds like you need (using the same dict layout as you used
previously):

def add_record(field1, field2, field3, field4):
    if field1 in results:
    #duplicate in field1, do nothing
        return

    for key, item in results.iteritems():
        if field2 == item['F2'] and field3 == item['F3']
        #duplicate in both field2 and field3
            if field4 > item['F4']:
            #new F4 is higher, remove old
                del results[key]
                break
            else:
                #old version of F4 is higher, do nothing
                return
    #if we get here, there are no important duplicates
    results[field1] = {'F2': field2, 'F3': field3, 'F4': field4}

I think this might not be exactly what you want, but hopefully it will point
you towards the solution.

-- 
Rich "Roadie Rich" Lovely

Just because you CAN do something, doesn't necessarily mean you SHOULD.
In fact, more often than not, you probably SHOULDN'T.  Especially if I
suggested it.

10 re-discover BASIC
20 ???
30 PRINT "Profit"
40 GOTO 10
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20100217/dda599e2/attachment.htm>


More information about the Tutor mailing list