Comparing sequences with range objects

Thu Apr 7 13:40:35 EDT 2022

On 2022-04-07 16:16, Antoon Pardon wrote:
> Op 7/04/2022 om 16:08 schreef Joel Goldstick:
>> On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon<antoon.pardon at vub.be>  wrote:
>>> I am working with a list of data from which I have to weed out duplicates.
>>> At the moment I keep for each entry a container with the other entries
>>> that are still possible duplicates.
>>>
>>> The problem is sometimes that is all the rest. I thought to use a range
>>> object for these cases. Unfortunatly I sometimes want to sort things
>>> and a range object is not comparable with a list or a tuple.
>>>
>>> So I have a list of items where each item is itself a list or range object.
>>> I of course could sort this by using list as a key function but that
>>> would defeat the purpose of using range objects for these cases.
>>>
>>> So what would be a relatively easy way to get the same result without wasting
>>> too much memory on entries that haven't any weeding done on them.
>>>
>>> --
>>> Antoon Pardon.
>>> --
>>> https://mail.python.org/mailman/listinfo/python-list
>> I'm not sure I understand what you are trying to do, but if your data
>> has no order, you can use set to remove the duplicates
> 
> Sorry I wasn't clear. The data contains information about persons. But not
> all records need to be complete. So a person can occur multiple times in
> the list, while the records are all different because they are missing
> different bits.
> 
> So all records with the same firstname can be duplicates. But if I have
> a record in which the firstname is missing, it can at that point be
> a duplicate of all other records.
> 
This is how I'd approach it:

# Make a list of groups, where each group is a list of potential duplicates.
# Initially, all of the records are potential duplicates of each other.
records = [list_of_records]

# Split the groups into subgroups according to the first name.
new_records = []

for group in records:
     subgroups = defaultdict(list)

     for record in group:
         subgroups[record['first_name']].append(record)

     # Records without a first name could belong to any of the subgroups.
     missing = subgroups.pop(None, [])

     for record in missing:
         for subgroup in subgroups.values():
             subgroup.extend(missing)

     new_records.extend(subgroups.values())

records = new_records

# Now repeat for the last name, etc.