[Python-ideas] Contiguous-array-based ordering for OrderedDict

Andrew Barnert abarnert at yahoo.com
Sat Dec 12 23:57:58 EST 2015


On Dec 12, 2015, at 17:03, Serhiy Storchaka <storchaka at gmail.com> wrote:
> 
>> On 13.12.15 02:06, Andrew Barnert via Python-ideas wrote:
>> On Dec 12, 2015, at 15:20, Serhiy Storchaka <storchaka at gmail.com> wrote:
>>>>> On 13.12.15 00:11, Andrew Barnert via Python-ideas wrote:
>>>>> On Dec 12, 2015, at 04:19, Serhiy Storchaka <storchaka at gmail.com> wrote:
>>>>>>> On 12.12.15 13:06, Andrew Barnert via Python-ideas wrote:
>>>>>>> On Dec 12, 2015, at 01:27, Franklin? Lee <leewangzhong+python at gmail.com> wrote:
>>>>>>> 2. You don't have to shift for each deletion. You can wait until some
>>>>>>> threshold is reached before shifting, and maybe that will spread out
>>>>>>> the cost of shift among deletions enough to make an impact.
>>>>>> 
>>>>>> I don't think this would help nearly as much as you think. Keeping up to half the array for deleted slots also makes things more complex, doubles the extra storage. But, worst of all, whatever savings you get in the shift time (minus the increased time for the extra logic) are going to be lost in the increased search time: if the array is twice as big, the Kth real element is at 2K. So, instead of K + (N-K), you have 2K + x(N - K), and no matter how good that x speedup is, the 2K part is going to kill you.
>>>>>> 
>>>>>> Plus, an API that spreads out about the same work but does it in large batches is less usable. Imagine that you had two functions, one which always takes 1.1ms, one which usually takes 100us but every 1000 times it takes 1s (and freezes up the interpreter whole doing so). Which one would you choose?
>>>>> 
>>>>> All this is true for ordinal dict too. A hashtable needs extra storage, and iterating needs to skip unused entries. From time to time you need resize the storage and fill new storage with O(N) complexity. Due to unpredictability of hashes of strings and pointers, the number of collisions an resizings is not predicable, and a time of work can significantly vary from run to run.
>>>> 
>>>> That's not the same at all. The benefit of dict isn't that it's amortized, it's that it's amortized _constant_ instead of linear, which is a huge improvement, worth a bit of chunkiness. Going from linear to amortized linear, as the OP proposes, doesn't get you anything good for the cost.
>>>> 
>>>> I think you may be forgetting that dict doesn't rehash "from time to time" as this proposal does, it only does it when you grow. And it expands exponentially rather than linearly, so even in the special case where 100% of your operations are inserts, it's still only going to happen a few times.
>>> 
>>> Either you or me misunderstood the OP proposition. An array of indices needs to be "compacted" (that costs O(n)) only after at least O(n) addition/deletion/moving operations. Therefore the amortized cost is constant in worst case.
>> 
>> You've got two halves of a process that both take N/2 time. If you turn one of those halves into amortized constant time, your total time is still linear. (Maybe in some cases you've made it twice as fast, but that's still linear--but at any rate, in this case, he hasn't really made it twice as fast, because the first half, the linear search, now has twice as much to search.) So he's adding linear chunkiness, without reducing the total time.
> 
> What is linear time about what you are saying? I don't see anything that has no amortized constant complexity. What I have missed?

Let's go back to the original message. He started off talking about Tim Peters' post, which explains the need for a linked list because in an array, searching for the value takes linear time, and then removing it in-place takes linear time.

He suggested fixing that by batching up the deletes. That solves the cost of removing, but it doesn't solve the problem with the linear search at all--in fact, it makes things worse, because you're now searching a sparse array. That's what I was commenting on here.

He later suggested maybe using a second hash for indices, as the current design does. This of course does solve the problem of the search, but then you have to change half the hash values for each shift, which makes that part even worse. I commented on that further below.

What about combining the two? He didn't mention that, but if you do that, every time you compact the array, you have to do a hash rebuild. Yes, a normal dict does this, but only logarithmically often, not every N/2 operations.

>>> All this looks similar to Raymond's proof-of-concept for a compact dictionary (ordering is a side effect). [1]
>> 
>> But his design only preserves ordering if you never delete (or move, but he doesn't have an API for that).
> 
> This is only because preserving ordering is not a goal for a dict.

Sure. But it is the goal for OrderedDict. Which is exactly why a design that makes sense for dict doesn't necessarily make sense for OrderedDict, unless you add something else. The hash table of list nodes works as such a something else. But Raymond's array doesn't (and isn't intended to).

> Moving the last entry to the place of deleted one is the simplest design if don't care about ordering.
> 
> But it is not hard to make the ordering be preserved (at the cost of larger amortized constant time and memory usage). Deleted entry should be just marked as deleted, and the storage should be packed only when a half (or 1/3, or 1/4) of entries is deleted.

The entire benefit of Raymond's design is that it lets you store, and iterate, a dense array instead of a sparse one. If you're going to make that array every sparser than the hash table, you've lost all of the benefits. (But you're still paying the cost of an extra index or pointer, extra dereference, etc.)



More information about the Python-ideas mailing list