Comparing sequences with range objects

Christian Gollwitzer auriocus at gmx.de
Sat Apr 9 02:14:57 EDT 2022


Am 08.04.22 um 09:21 schrieb Antoon Pardon:
>> The first is really hard. Not only may information be missing, no single
>> single piece of information is unique or immutable. Two people may have
>> the same name (I know about several other "Peter Holzer"s), a single
>> person might change their name (when I was younger I went by my middle
>> name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
>> same person?), they will move (= change their address), change jobs,
>> etc. Unless you have a unique immutable identifier that's enforced by
>> some authority (like a social security number[1]), I don't think there
>> is a chance to do that reliably in a program (although with enough data,
>> a heuristic may be good enough).
> 
> Yes I know all that. That is why I keep a bucket of possible duplicates
> per "identifying" field that is examined and use some heuristics at the
> end of all the comparing instead of starting to weed out the duplicates
> at the moment something differs.
> 
> The problem is, that when an identifying field is judged to be unusable,
> the bucket to be associated with it should conceptually contain all other
> records (which in this case are the indexes into the population list).
> But that will eat a lot of memory. So I want some object that behaves as
> if it is a (immutable) list of all these indexes without actually 
> containing
> them. A range object almost works, with the only problem it is not
> comparable with a list.


Then write your own comparator function?

Also, if the only case where this actually works is the index of all 
other records, then a simple boolean flag "all" vs. "these items in the 
index list" would suffice - doesn't it?

	Christian



More information about the Python-list mailing list