Drowning in a teacup?

Michael Selik michael.selik at gmail.com
Sat Apr 2 10:36:45 EDT 2016


On Sat, Apr 2, 2016 at 6:32 AM Peter Otten <__peter__ at web.de> wrote:

> Vito De Tullio wrote:
>
> > Michael Selik wrote:
> >
> >>> > I need to scan a list of strings. If one of the elements matches the
> >>> > beginning of a search keyword, that element needs to snap to the
> front
> >>> > of the list.
> >>>
> >>> I know this post regards the function passing, but, on you specific
> >>> problem,
> >>> can't you just ... sort the list with a custom key?
> >>
> >> If the number of matches is small relative to the size of the list, I'd
> >> expect the sort would be slower than most of the other suggestions.
> >
> > umm... I take the liberty to set up a little test
> >
> >     $ cat test_sort_prefix.py
> >     #!/usr/bin/env python3
> >
> >     from sys import argv
> >     from timeit import timeit
> >
> >
> >     def sort_in_place(l):
> >         def key(e):
> >             return not e.startswith('yes')
> >         l.sort(key=key)
> >
> >
> >     def original_solution(mylist):
> >         for i in range(len(mylist)):
> >             if(mylist[i].startswith('yes')):
> >                 mylist[:] = [mylist[i]] + mylist[:i] + mylist[i+1:]
>
> original_solution() reverses the order of the matches while sort_in_place()
> doesn't. If the order is not relevant I'd try something like
>
> def exchange(items):
>     pos = 0
>     for index, item in enumerate(items):
>         if item.startswith("yes"):
>             items[pos], items[index] = item, items[pos]
>             pos += 1
>
> >     def main():
> >         if argv[1] == 'sort_in_place':
> >             f = sort_in_place
> >         elif argv[1] == 'original_solution':
> >             f = original_solution
> >         else:
> >             raise Exception()
> >
> >         nomatches = int(argv[2])
> >         matches = int(argv[3])
> >
> >         l = ["no_" for _ in range(nomatches)] + ["yes_" for _ in
> > range(matches)]
>
> Python's timsort knows how to deal with (partially) sorted lists. I suggest
> that you add
>
> random.seed(42) # same list for all script invocations
> random.shuffle(l)
>
> here to spread the matches somewhat over the list.
>
> >         print(timeit("f(l)", number=100, globals={"f": f, "l": l}))
>
> Remember that f(l) modifies l, so all but the first invocation will see a
> list where all matches are at the beginning. In other words: in 99 out of
> 100 runs the list is already sorted.
>
> While adding the overhead of copying the list I would expect the results of
>
> timeit("f(l[:])", ...)
>
> to be more realistic.
>
> >     if __name__ == '__main__':
> >         main()
> >     $ ./test_sort_prefix.py sort_in_place 1000000 3
> >     33.260575089996564
> >     $ ./test_sort_prefix.py original_solution 1000000 3
> >     35.93009542999789
> >     $
> >
> >
> > in my tests there is no sensible difference between the two solutions...
> > and the number of matches is risible :)
>
> Indeed, as the number of matches rises your sort() approach will likely
> fare
> better.
>
> Your conclusions may be correct, but the benchmark to support them is
> flawed.
>
>
Option 1: scan and match, pop, insert
Scan and match is O(n)
pop is O(n) ... gotta shift everything over one after deleting
insert is O(n) on a list, better use deque.appendleft for O(1)

Option 2: map keyfunc, sort
Mapping the keyfunction is O(n)
(tim)Sorting is O(n log n) worst case, O(n) best case.

Option 1 using a deque will be O(n).
Option 2 will be O(n log n) worst case and O(n) best case.

On small datasets the constant factors matter, but then it's small, so who
cares about speed? (partially joking)



More information about the Python-list mailing list