Merging lists has made my brain hurt.

Mon Oct 7 20:04:36 EDT 2002

> Thorsten Kampe wrote:
>> I meant this: in a real world scenario it's often not practical to
>> say: use this program, but beware, it's relies on an input without
>> duplicates or relies on all items to be hashable, because you can't
>> tell in advance whether the conditions will be met.
> 
> Right, but you can at least check whether they're met, and fall back
> to less-efficient means if they aren't.

Well, my demands for code are different: short, clean (modular) and
not O(n**2) which means the time to run a program should be linear or
logarithmic, not quadratic.

"Optimization" and error checking makes your code three or four times
as long (and without exception handling even more).

>> Sorry, I wrote quite imprecisely. My interpretation of 'tuple' was
>> contradictory to the traditional mathematical meaning 'order and
>> repetition counts' but it meant "repetition counts, order doesn't".
> 
> So what you want are 'bags'. OK by me, but "you can't tell in
> advance whether the conditions will be met" (your words!) ALSO
> applies -- and here it's quite a bit worse than wrt duplicates or
> hashability, because you CAN'T check whether this "condition"
> matters... if you're fed lists, there's no way to know whether those
> lists' items' ordering are considered meaningful within the given
> application program, or no.

Sorry, If I run my program I have to know what I want or I should get
a clue. But I cannot tell if the input stream meets some special
conditions (except by letting the program do the testing and decide
which algorithm to use).

>> You "connect" the "corresponding" items (the first '2' in seq1 with
>> the first in seq2, the second '2' in seq1 with the second '2' in
>> seq2, the third '2' in seq1 with the third '2' in seq2 ... *ooops*)
>>    [1, 2, 2, 2, 3]
>>     |  |  |     |     <- "non proportional font required"
>> [0, 1, 2, 2,    3, 4]
>> 
>> So the intersection should contain all the items that can be
>> "paired"/"connected". The last '2' of seq2 has no 'partner' so it
>> shouldn't be part of the intersection.
> 
> If order doesn't matter, then it does matter WHICH '2's of seq2 are
> or aren't part of the intersection, of course [...]

That doesn't make sense to me. If order doesn't matter but the count,
then it doesn't matter which two '2' of seq1 are part of the
intersection but that it's two '2's (and not one or three).

   [1, 2, 2, 2, 3]           [1, 2, 2, 2, 3]          [1, 2, 2, 2, 3]
    |  |  |     |      or     |     |  |  |     or     |  |     |  |
[0, 1, 2, 2,    3, 4]     [0, 1,    2, 2, 3, 4]    [0, 1, 2,    2, 3, 
4]

> If order is meaningful (and if you're using sequences it should
> normally be), then throwing order-information away is a very suspect
> design decision.

The aspect of "order" I was thinking of, was the "output order".
What's the intersection of [1, 2, 3] and [3, 2, 1]? I could not think
of any "natural" reason to declare: "it's [1, 2, 3] and not [3, 2, 1]"
(or vice versa).

>>> If what you want is to treat sequences as BAGS, then they're
>>> better represented by dicts again -- order is the only information
>>> present in a list and not in a dict, [...]
>> 
>> Well, usually not:
>>>>> {1: 2, 1: 2} == {1:2}
>> 1
>> 
>>>>> [2, 2] == [2]
>> 0

> Not sure what you mean here. [...]

A "naive" explanation of the difference between dictionary and list
would be: in a list repetition and order matters, in a dictionary not:
[1, 2, 3] != [1, 3, 2] != [1, 2, 2, 3]; {1:11, 2:22} == {2: 22, 1:11}
== {1:11, 2:22, 1:11} (at least that's what the interpreter says)

> It doesn't matter that you can write dict {1:2} (representing the
> same bag as [1,1] in sequence-representation) with many different
> dictionary-display strings -- that's just an artifact of the parser,
> not inherent in the in-memory representation of the abstract datum
> "bag" as a concrete Python object. That you can write {1:2,1:2} with
> the same meaning as {1:2} is almost as irrelevant as the fact that
> you can add spaces, or a trailing comma, at your pleasure -- it
> doesn't change the dictionary object we're getting from the parser.

Yes, but what you say is not "naive" anymore but "indepth". ;-)

>>> and for example "intersection" of bags then becomes, e.g:
>>>
>>> def intersectBags(bag1, bag2):
>>>     result = {}
>>>     for item in bag1:
>>>         count = min(bag1[item], bag2.get(item, 0))
>>>         if count>0: result[item] = count
>>           ^^^^^^^^^^^ <- this is superfluous IMO
> 
> I prefer to avoid redundant storing of "mapping to 0", since to all
> intents and purposes they're equivalent to absence from the bag.

Yes, but what is more "costly"? The "if-branch" in the loop or the
possible "null objects" (1:0, 2:0) in memory? I don't know.

> Maybe it's an artefact of having worked too much in combinatorics
> (where double-counting is a perennial worry:-), [...]

Glad to hear that! So you're victimized to have a quick look at my
"combinatoric" program that deals with permutations, combinations and
variations (with and without repetition).

>> This is what I made of your code (it's a little optimized compared to
>> my previous version so that the loop goes always through the smaller
>> sequence.) I didn't change the 'union' and "symmetric difference" part
>> because you have to loop through count0 *and* count1.
> [...]
>>         count0 = tocount(seq0)
>>         count1 = tocount(seq1)
>> 
>>         if count0 > count1:  # loop through the shorter list
> 
> that's NOT what you're testing for here -- use len(count0) etc
> to test the lengths.

Are you sure? I couldn't find a definition for "<" with dictionaries
and "{111: 111} < {11:11, 22:22}" but "{111: 111, 22:22} > {11:11,
22:22}" made me think: "'<' first tests the lengths and if they're the
same the keys are compared", but that was my assumption.

> have another auxiliary function:
>     def binop(seq0, seq1, op):
>         count0 = tocount(seq0)
>         count1 = tocount(seq1)
>         if len(count0) > len(count1):
>             count1, count0 = count0, count1
>         for item in count0:
>             count0[item] = op(count0[item], count1.get(item, 0))
> 
> and implement your 'and' by returning binop(seq0, seq1, min) and
> your 'or' by returning binop(seq0, seq1, max) -- I think this
> also clarifies the parallelism between the two cases.

It's not that easy. Union and symmetric difference take elements of
both sequences (e. g. union([11, 22], [22, 33])) so you have to loop
through both sequences!

So for the union, I think my "seq0 + (seq1 - seq0)" is shorter (and
probably faster).

> A few more points:
>> [...]
>>             count[repr(item)] = count.get(repr(item), 0) + 1
>>         return count
>> [...]
>>         return [item for key in count for item in [eval(key)] *
>>         count[key]]
> 
> You're FAR more likely to meet objects where you can't count
> on eval(repr(x)) == x, than non-hashable objects.

When would that be?

>>     elif boolean_modus == 'not':  # set difference
>>         count0 = tocount(seq0)
>>         count1 = tocount(seq1)
>> 
>>         if count0 < count1:  # loop through the shorter list
>>             for item in count0:
>>                 count0[item] = count0[item] - count1.get(item, 0)
>>         else:
>>             for item in count1:
>>                 count0[item] = count0.get(item, 0) - count1[item]
> 
> I don't understand this.  This way you can end up with negative
> values for several items in count0 -- what do they MEAN?  You may
> luck out in that [blah]*N for ANY N<=0 is [], but it seems quite
> peculiar to me to COUNT on this (particularly w/o a comment!-).

Yes, the implementation of "[item] * N" is quite useful for n < 1. It
means "no occurrence left for item in seq0": [2] - [2, 2, 2, 2] ->
{2: -3}

>>     elif boolean_modus == 'xor':  # symmetric difference
>>         return boolean(boolean(seq0, seq1, 'or'), boolean(seq0, seq1,
>>         'and'), 'not')
> 
> This one could also be more speedily implemented with the
> already presented binop and another little auxiliary function:
> 
>     def maxminusmin(onecount, anothercount):
>         return max(onecount, anothercount) - min(onecount, anothercount)

That's abs(onecount - anothercount), isn't it?

> and your 'xor' is implementable as
>     return binop(seq0, seq1, maxminusmin)

Same as for union: I'd have to loop through both lists, but that's
cheap compared to the triple todict -> to seq conversion I do at the 
moment.

Thorsten