Comparing sequences with range objects

Ian Hobson hobson42 at gmail.com
Sun Apr 10 00:37:51 EDT 2022



On 09/04/2022 13:14, Christian Gollwitzer wrote:
> Am 08.04.22 um 09:21 schrieb Antoon Pardon:
>>> The first is really hard. Not only may information be missing, no single
>>> single piece of information is unique or immutable. Two people may have
>>> the same name (I know about several other "Peter Holzer"s), a single
>>> person might change their name (when I was younger I went by my middle
>>> name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
>>> same person?), they will move (= change their address), change jobs,
>>> etc. Unless you have a unique immutable identifier that's enforced by
>>> some authority (like a social security number[1]), I don't think there
>>> is a chance to do that reliably in a program (although with enough data,
>>> a heuristic may be good enough).
>>
>> Yes I know all that. That is why I keep a bucket of possible duplicates
>> per "identifying" field that is examined and use some heuristics at the
>> end of all the comparing instead of starting to weed out the duplicates
>> at the moment something differs.
>>
>> The problem is, that when an identifying field is judged to be unusable,
>> the bucket to be associated with it should conceptually contain all other
>> records (which in this case are the indexes into the population list).
>> But that will eat a lot of memory. So I want some object that behaves as
>> if it is a (immutable) list of all these indexes without actually 
>> containing
>> them. A range object almost works, with the only problem it is not
>> comparable with a list.
> 
> 
> Then write your own comparator function?
> 
> Also, if the only case where this actually works is the index of all 
> other records, then a simple boolean flag "all" vs. "these items in the 
> index list" would suffice - doesn't it?
> 
>      Christian
> 
Writing a comparator function is only possible for a given key. So my 
approach would be:

1) Write a comparator function that takes params X and Y, such that:
if key data is missing from X, return 1
If key data is missing from Y return -1
if X > Y return 1
if X < Y return -1
return 0 # They are equal and key data for both is present

2) Sort the data using the comparator function.

3) Run through the data with a trailing enumeration loop, merging 
matching records together.

4) If there are no records copied out with missing
key data, then you are done, so exit.

5) Choose a new key and repeat from step 1).

Regards

Ian

-- 
Ian Hobson
Tel (+66) 626 544 695

-- 
This email has been checked for viruses by AVG.
https://www.avg.com



More information about the Python-list mailing list