[Python-ideas] Uniquify attribute for lists

Thu Nov 22 23:33:49 CET 2012

On 22 November 2012 00:06, Andrew Barnert <abarnert at yahoo.com> wrote:

> From: Joshua Landau <joshua.landau.ws at gmail.com>
> Sent: Sat, November 17, 2012 11:38:22 AM
>
> >Surely the best choice is two have *two* caches; one for hashables and
> another
> >for the rest.
>
> Your implementation does a try: hash() to decide whether to check the set
> or the
> list, instead of just doing a try: item in set except: item in list. Is
> there a
> reason for this? It's more complicated, and it's measurably slower.

I did not realise that "[] in set()" raised an error! I'd just assumed it
returned False.

Thank you, this does make small but significant difference.

>  >This might be improvable with a *third* chache if some non-hashables had
> total
> >ordering, but that could also introduce bugs I think. It'd also be a lot
> harder
> >and likely be slower anyway.
>
> I agree that it's probably not worth adding to something in the standard
> library, or a recipe given in the documentation (in fact, I think I
> already said
> as much earlier in the thread), but I think you've got most of those facts
> wrong.
>
> It's not a lot harder. The same 4 lines you have to add to do a
> try-set-except-list, you just do again, so it's
> try-set-except-try-sortedlist-except-list.

Well, I'd sort-of assumed that this included adding  sorted collection to
the mix, as it isn't in the standard library.

> And it won't introduce any bugs.

This took me a while to prove, so I'm proud of this:

>>> from blist import sortedlist
>>> {2} in sortedlist([{1, 2}, {1, 3}, {2}])
False

You *cannot* assume that a data set has total ordering on the basis that
it's working so far.

> And
> as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M
> unique,
> which is obviously better, but probably slower for tiny M, and another
> 5-10%
> overhead for inappropriate values.
>

Well yes... bar the fact that you may be using it on something with a
non-even distribution of "things" where some types are not comparable to
each-other:

[ {1, 2}, [3, 4], [1, 2], [7, 4], [2, 3], (5, 2), [2, 1] ... ]

Where you'll get nowhere near O(NlogM).

*And* then there's the fact that sorted collections have intrinsically more
overhead, and so are likely to give large overhead.

The problem is finding an appropriate sortedcollection type. There isn't
> one in
> the standard library. There's a link to an external SortedCollection
> reference
> in the bisect docs page, but that has O(N) insertion time, so it won't
> help. The
> most popular library I know of is blist.sortedlist, and that works, but it
> has
> quite a bit of overhead for searching tiny lists. As I understand it, the
> reason
> there isn't a standard sorted collection is the assumption that people
> dealing
> with huge sequences ought to expect  to have some searching, comparing, and
> profiling of algorithms in their  future, while those people dealing with
> len<10
> sequences shouldn't have to think at all.
>
> At any rate, I tried a few different sorted collections. The performance
> for
> single-digit M was anywhere from 2.9x slower to 38x slower (8x with
> blist); the
> crossover was around M=100, and you get 10x faster by around M=100K.
> Deciding
> whether this is appropriate, and which implementation to use, and so on…
> well,
> that's exactly why there's no sorted list in the stdlib in the first place.

Thank you for the numbers. May I ask what libraries you used?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121122/bf9bd3cd/attachment.html>