Merging lists has made my brain hurt.

Alex Martelli aleax at aleax.it
Mon Oct 7 11:22:28 EDT 2002


Thorsten Kampe wrote:
        ...
> I meant this: in a real world scenario it's often not practical to
> say: use this program, but beware, it's relies on an input without
> duplicates or relies on all items to be hashable, because you can't
> tell in advance whether the conditions will be met.

Right, but you can at least check whether they're met, and fall
back to less-efficient means if they aren't.

>> If you don't care about duplicates you're much better off
>> normalizing first, and if you care about COUNT of duplicates but not
>> ordering (i.e., a bag) you're generally still better off normalizing
>> first.
> 
> What do you mean exactly by "normalizing"?

Sorry, I was sloppy here -- I mean "moving to a better way to
represent sets" (dicts, or Sets in Python 2.3).  Even for bags,
there are more usable representations than "sequences with
repetitions in random order", btw -- see further.

> Sorry, I wrote quite imprecisely. My interpretation of 'tuple' was
> contradictory to the traditional mathematical meaning 'order and
> repetition counts' but it meant "repetition counts, order doesn't".

So what you want are 'bags'.  OK by me, but "you can't tell in
advance whether the conditions will be met" (your words!) ALSO
applies -- and here it's quite a bit worse than wrt duplicates
or hashability, because you CAN'T check whether this "condition"
matters... if you're fed lists, there's no way to know whether
those lists' items' ordering are considered meaningful within
the given application program, or no.

>> What you mean by "corresponding" there is rather mysterious to me,
>> but wouldn't it mean "same item in the SAME position"?
> 
> I was asking myself: what would be a sensible generalization for the
> common set operations (intersection, union, etc.) for tuples?

For BAGS (order meaningless), the natural representation is as
a mapping item-># of occurrences, and the natural generalization
of intersection and union use min and max (the natural gener. to
integers of and and or on booleans!) of course equating the
mapping to 0 with absence from the bag.


> Precisely: what is (or should be) the intersection of [2, 2] and [2,
> 2, 2]? Treat them as sets? Then the result is '[2]'.
> 
> Treat them as tuples? The most common algorithm is: for each item in
> first seq: if it is in second seq, then add it to the intersection.
> 
> But then:
> intersect([2, 2], [2, 2, 2]) = [2, 2]
> intersect([2, 2, 2], [2, 2]) = [2, 2, 2]
> 
> You "connect" the "corresponding" items (the first '2' in seq1 with
> the first in seq2, the second '2' in seq1 with the second '2' in seq2,
> the third '2' in seq1 with the third '2' in seq2 ... *ooops*)
>    [1, 2, 2, 2, 3]
>     |  |  |     |     <- "non proportional font required"
> [0, 1, 2, 2,    3, 4]
> 
> So the intersection should contain all the items that can be
> "paired"/"connected". The last '2' of seq2 has no 'partner' so it
> shouldn't be part of the intersection.

If order doesn't matter, then it does matter WHICH '2's of seq2
are or aren't part of the intersection, of course -- which is
part of why representing bags by mappings is so much more
natural than representing them by sequences with repetitions
in random order.

If you ARE constrained to use sequences (because mappings are
forbidden, e.g. by the non-hashable nature of items in a
context where you're only allowed to use dicts for mappings)
I think you're still better off simulating a mapping than
fighting with the in-natural representation of bags as
sequences with repetitions in random order -- when you don't
care about the order, it just gets in the way!

So, if I had to support non-hashables, I would fall back
to a representation as a list of two-items lists, meaning
item/count.  I would actually wrap that up in all the
trappings of a mapping, and use dicts normally, falling
back to the inefficient linear representation only for
non-hashable items.  In many cases, I could have a dict
for most stuff, with a small "look-aside list of pairs"
just for those few items that are non-hashable.  Another
usable alternative is sometimes to make non-hashables
into hashables, e.g. via pickle (costly, to be sure, but
it can make the differene between O(N) and O(N squared)
work).  But I'm getting sidetracked here, sorry.


>> But it's NOT -- [1, 2, 3, 2] != [2, 1, 2, 3]. You're losing
>> information, specifically order information. A correct definition
>> should be:
>>
>> def intersect(seq1, seq2):
>>     return [ x for x, y in zip(seq1, seq2) if x == y ]
> 
> This is correct for "if first item of seq1 = first item of seq2" but I
> had never use for such an "intersection". I could not see any "real
> use" for intersect([1, 2, 3], [3, 1, 2]) = [].

"What songs are in the same position in our hit list as they were
last week?" is an often-answered question on radio programs that
give "hit parades", for example.  "none of them" is a valid
answer.  If order is meaningful (and if you're using sequences
it should normally be), then throwing order-information away is
a very suspect design decision.


>> If what you want is to treat sequences as BAGS, then they're better
>> represented by dicts again -- order is the only information present
>> in a list and not in a dict, [...]
> 
> Well, usually not:
>>>> {1: 2, 1: 2} == {1:2}
> 1
> 
>>>> [2, 2] == [2]
> 0

Not sure what you mean here.  If you're representing bags by dicts
in this way, sequence [2, 2] is represented by dict {2: 2}, i.e.,
"item 2 is present twice".  Here you don't lose information just
because you consider the two occurrences indistinguishable.  More
often you do, e.g. there are three different ways to represent
baf {1:2, 2:1} as a sequence: [1,1,2], [1,2,1], [2,1,1].  Having
many ways to represent what is to be considered the "same" thing
to all intent and purposes should raise suspicion off the bat.

It doesn't matter that you can write dict {1:2} (representing
the same bag as [1,1] in sequence-representation) with many
different dictionary-display strings -- that's just an artifact
of the parser, not inherent in the in-memory representation of
the abstract datum "bag" as a concrete Python object.  That you
can write {1:2,1:2} with the same meaning as {1:2} is almost as
irrelevant as the fact that you can add spaces, or a trailing
comma, at your pleasure -- it doesn't change the dictionary
object we're getting from the parser.

>> and for example "intersection" of bags then becomes, e.g:
>>
>> def intersectBags(bag1, bag2):
>>     result = {}
>>     for item in bag1:
>>         count = min(bag1[item], bag2.get(item, 0))
>>         if count>0: result[item] = count
>           ^^^^^^^^^^^ <- this is superfluous IMO

I prefer to avoid redundant storing of "mapping to 0", since
to all intents and purposes they're equivalent to absence
from the bag.  Once again, instinctively I prefer to find
a CANONICAL representation -- where each abstract object is
represented in ONE way as a concrete object.  Maybe it's an
artefact of having worked too much in combinatorics (where
double-counting is a perennial worry:-), or maybe it's an
application of Python's motto "there should be one way,
and ideally only one way".  Sometimes, for efficiency, I
have to make do with non-canonical implementations during
computation, and normalize them back to canonical form at
the end (e.g., computing with rationals, go back to the
canonical representation with mutually prime numerator and
denominator only after a series of computations), but I'm
always somewhat uncomfortable when such needs arise -- as
I consider them an optimization, I'm VERY loath to perform
them prematurely:-).

>>     return result
> 
> Yes, definitely better than my "append to result, remove from original
> list to prevent item being counted twice" version. And a lot faster
> (except for the abnormal "very large seq0, very small seq1" case).

Yes, you can check lengths and swap the bags if you want to
ensure O(min(len(bag1), len(bag2))) performance -- again this
seems to me to be more of an optimization than a "deep" issue,
but of course it may have practical importance.


> This is what I made of your code (it's a little optimized compared to
> my previous version so that the loop goes always through the smaller
> sequence.) I didn't change the 'union' and "symmetric difference" part
> because you have to loop through count0 *and* count1.

Given that you delegate some of the implementation of your 'or' to
your 'not', which goes to the trouble of converting back and
forth, I'd bet you would be somewhat better off by simplifyind --
rather than:

>     if boolean_modus == 'and':    # intersection
>         count0 = tocount(seq0)
>         count1 = tocount(seq1)
> 
>         if count0 > count1:  # loop through the shorter list

that's NOT what you're testing for here -- use len(count0) etc
to test the lengths.

>             count0, count1 = count1, count0
>         for item in count0:
>             count0[item] = min(count0[item], count1.get(item, 0))
>         return toseq(count0)

have another auxiliary function:
    def binop(seq0, seq1, op):
        count0 = tocount(seq0)
        count1 = tocount(seq1)
        if len(count0) > len(count1):
            count1, count0 = count0, count1
        for item in count0:
            count0[item] = op(count0[item], count1.get(item, 0))

and implement your 'and' by returning binop(seq0, seq1, min) and
your 'or' by returning binop(seq0, seq1, max) -- I think this
also clarifies the parallelism between the two cases.


A few more points:

> def boolean(seq0, seq1, boolean_modus):
>     """" perform 'and', 'not', 'or', or 'xor' operation on sequences """
>     ## auxiliary functions
>     def tocount(seq):  # convert to dict where key is the hashable
>     representation
>         count = {}     # of the object and value is the object count
>         for item in seq:
>             count[repr(item)] = count.get(repr(item), 0) + 1
>         return count
> 
>     def toseq(count):  # convert back to sequence
>         return [item for key in count for item in [eval(key)] *
>         count[key]]

You're FAR more likely to meet objects where you can't count
on eval(repr(x)) == x, than non-hashable objects.

>     elif boolean_modus == 'not':  # set difference
>         count0 = tocount(seq0)
>         count1 = tocount(seq1)
> 
>         if count0 < count1:  # loop through the shorter list
>             for item in count0:
>                 count0[item] = count0[item] - count1.get(item, 0)
>         else:
>             for item in count1:
>                 count0[item] = count0.get(item, 0) - count1[item]

I don't understand this.  This way you can end up with negative
values for several items in count0 -- what do they MEAN?  You may
luck out in that [blah]*N for ANY N<=0 is [], but it seems quite
peculiar to me to COUNT on this (particularly w/o a comment!-).


>     elif boolean_modus == 'xor':  # symmetric difference
>         return boolean(boolean(seq0, seq1, 'or'), boolean(seq0, seq1,
>         'and'), 'not')

This one could also be more speedily implemented with the
already presented binop and another little auxiliary function:

    def maxminusmin(onecount, anothercount):
        return max(onecount, anothercount) - min(onecount, anothercount)

and your 'xor' is implementable as
    return binop(seq0, seq1, maxminusmin)


Alex




More information about the Python-list mailing list