Comparing sequences with range objects

duncan smith duncan at invalid.invalid
Fri Apr 8 10:28:30 EDT 2022


On 08/04/2022 08:21, Antoon Pardon wrote:
> 
> 
> Op 8/04/2022 om 08:24 schreef Peter J. Holzer:
>> On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote:
>>> Op 7/04/2022 om 16:08 schreef Joel Goldstick:
>>>> On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon<antoon.pardon at vub.be>   
>>>> wrote:
>>>>> I am working with a list of data from which I have to weed out 
>>>>> duplicates.
>>>>> At the moment I keep for each entry a container with the other entries
>>>>> that are still possible duplicates.
>> [...]
>>> Sorry I wasn't clear. The data contains information about persons. 
>>> But not
>>> all records need to be complete. So a person can occur multiple times in
>>> the list, while the records are all different because they are missing
>>> different bits.
>>>
>>> So all records with the same firstname can be duplicates. But if I have
>>> a record in which the firstname is missing, it can at that point be
>>> a duplicate of all other records.
>> There are two problems. The first one is how do you establish identity.
>> The second is how do you ween out identical objects. In your first mail
>> you only asked about the second, but that's easy.
>>
>> The first is really hard. Not only may information be missing, no single
>> single piece of information is unique or immutable. Two people may have
>> the same name (I know about several other "Peter Holzer"s), a single
>> person might change their name (when I was younger I went by my middle
>> name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
>> same person?), they will move (= change their address), change jobs,
>> etc. Unless you have a unique immutable identifier that's enforced by
>> some authority (like a social security number[1]), I don't think there
>> is a chance to do that reliably in a program (although with enough data,
>> a heuristic may be good enough).
> 
> Yes I know all that. That is why I keep a bucket of possible duplicates
> per "identifying" field that is examined and use some heuristics at the
> end of all the comparing instead of starting to weed out the duplicates
> at the moment something differs.
> 
> The problem is, that when an identifying field is judged to be unusable,
> the bucket to be associated with it should conceptually contain all other
> records (which in this case are the indexes into the population list).
> But that will eat a lot of memory. So I want some object that behaves as
> if it is a (immutable) list of all these indexes without actually 
> containing
> them. A range object almost works, with the only problem it is not
> comparable with a list.
> 

Is there any reason why you can't use ints? Just set the relevant bits.

Duncan


More information about the Python-list mailing list