[Tutor] Any ideas on designing an efficient algorithm for this problem?

Sat Jun 24 21:34:09 EDT 2023

Alan has some good suggestions and it obviously depends on your data and requirements.

Before I say more, you did not specify what this was for. If it is a classroom assignment, the best solution may relate to what was taught in class. That may well mean not using some data structures and modules and using what you are supposed to know at this point.

But first, in any case, make sure you completely understand what the data looks like and what details about the answer apply. Some algorithms will not provide what you want, otherwise.

Do you need the 100 extreme numbers exactly and in order including duplicates?

Are the numbers all integers or can they be floating point, or even complex, or something like Decimal or can they contain negative numbers, or empty numbers or even infinity? You may need to preprocess all the billion numbers before beginning whatever algorithm you choose. As an example, Inf may have to be replaced with some maximum number, and floating point may need to be truncated or rounded and text may need to be converted to a number format like the rest. GIGO.

If there are duplicates, do you want all copies collapsed into one and the remaining 100 most extreme returned in perhaps any order?

Alan suggested just converting the numbers into a set and presumably this smaller set of numbers could be easier to work with such as just sorting it. This may not work unless all you have are integers and even then it has problems like not keeping duplicates if that is required. Using a dictionary where the value keeps track of how many of each, might be a way to go then.

Breaking it into smaller groups also may be imperfect. There may be multiple solutions within each group. As Alan points out, you would need to keep 100 from each and merge that and then keep the last.

Sorting as in a merge sort might be interesting but for a billion items still a tad expensive. Assuming you want duplicates for now. You would want a modified merge sort such as a binary version that splits the current region of the data in roughly two as long as the size is no less than 100 or 50 and then stop and start returning a sorted answer. The merge part would take only 100 best results and pass those back and ignore the rest. Each merge on the way back up would only return 100 so when you get to the top, you have your answer

If you wanted to do this in parallel, that gets a tad harder.

And there is of course the simple-minded approach. Find the largest number in a linear search and add it to your results list and remove it from the billion or perhaps mark a location in a set of offsets to be ignored. Repeat 99 times.

Too many other ways are possible such as creating a tree (perhaps binary) until you have a billion entries and then traverse the tree till you get 100.

I won't provide it, but if this is for your own use, there are modules you can load that do what you asked of getting the top N. You can search for those but obviously a homework using those defeats the purpose.

- Q

Sent with Proton Mail secure email.

------- Original Message -------
On Saturday, June 24th, 2023 at 8:23 PM, Alan Gauld via Tutor <tutor at python.org> wrote:

> On 25/06/2023 00:07, marc nicole wrote:
> 
> > sry, I want to get 100 greatest elements (not the 100th element)
> 
> 
> Oops! Sorry I completely misread the requirement.
> 
> My first response would be to convert the list to a set(to remove
> duplicates - although you may want to include duplicates, you don't
> specify), sort the set and pick out the 100th element from the end.
> 
> I'm sure there are cleverer algorithms but I'd try the simple
> approach first and see if it's fast enough.
> 
> I'd also look at how you build the list in the first case,
> is it possible to build it in sorted order? Often creating
> a sorted list initially is faster overall than building a
> random list then sorting it.
> 
> Thinking about concurrent solutions you would need to collect
> the 100 highest members for each sublist you created. Then
> merge/sort the groups of 100 and pick out the 100th overall.
> If you have a relatively small set of sublists, say 100,
> each with 10million members then merging/sorting 100x100 may
> well be faster than sorting 1 billion. But timing it to
> check will be critical. But threading 100 process intensive
> instances may well lead to its own issues and moving to
> async or multi-tasking is likely to be slower too. A lot of
> experimentation and testing will be needed I suspect.
> 
> --
> Alan G
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
> http://www.amazon.com/author/alan_gauld
> Follow my photo-blog on Flickr at:
> http://www.flickr.com/photos/alangauldphotos
> 
> 
> 
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor