[Tutor] decomposing a problem

Thu Dec 27 17:38:13 EST 2018

On Wed, Dec 26, 2018 at 11:02:07AM -0500, Avi Gross wrote:

> I often find that I try to make a main point ad people then focus on
> something else, like an example.

I can't speak for others, but for me, that could be because of a number 
of reasons:

- I agree with what you say, but don't feel like adding "I agree!!!!" 
after each paragraph of yours;

- I disagree, but can't be bothered arguing;

- I don't understand the point you intend to make, so just move on.

But when you make an obvious error, I tend to respond. This is supposed 
to be a list for teaching people to use Python better, after all.

> So, do we agree on the main point that choosing a specific data structure or
> algorithm (or even computer language) too soon can lead to problems that can
> be avoided if we first map out the problem and understand it better?

Sure, why not? That's vague and generic enough that it has to be true.

But if its meant as advice, you don't really offer anything concrete. 
How does one decide what is "too soon"? How does one avoid design 
paralysis?

> I do not concede that efficiency can be ignored because computers are fast.

That's good, but I'm not sure why you think it is relevant as I never 
suggested that efficiency can be ignored. Only that what people *guess* 
is "lots of data" and what actually *is* lots of data may not be the 
same thing.

> I do concede that it is often not worth the effort or that you can
> inadvertently make things worse and there are tradeoffs.

Okay.

> Let me be specific. The side topic was asking how to get a random key from
> an existing dictionary. If you do this ONCE, it may be no big deal to make a
> list of all keys, index it by a random number, and move on. I did supply a
> solution that might(or might not) run faster by using a generator to get one
> item at a time and stopping when found. Less space but not sure if less
> time.

Why don't you try it and find out?

> But what I often need to do is to segment lots of data into two piles. One
> is for training purposes using some machine learning algorithm and the
> remainder is to be used for verifications. The choice must be random or the
> entire project may become meaningless. So if your data structure was a
> dictionary with key names promptly abandoned, you cannot just call pop()
> umpteen times to get supposedly random results as they may come in a very
> specific order.

Fortunately I never suggested doing that.

> If you want to have 75% of the data in the training section,
> and 25% reserved, and you have millions of records, what is a good way to
> go? 

The obvious solution:

keys = list(mydict.keys())
random.shuffle(keys)
index = len(keys)*3//4
training_data = keys[:index]
reserved = keys[index:]

Now you have the keys split into training data and reserved data. To 
extract the value, you can just call mydict[some_key]. If you prefer, 
you can generate two distinct dicts:

training_data = {key: mydict[key] for key in training_data}

and similarly for the reserved data, and then mydict becomes redundant 
and you are free to delete it (or just ignore it).

Anything more complex than this solution should not even be attempted 
until you have tried the simple, obvious solution and discovered that it 
isn't satisfactory.

Keep it simple. Try the simplest thing that works first, and don't add 
complexity until you know that you need it.

By the way, your comments would be more credible if you had actual 
working code that demonstrates your point, rather than making vague 
comments that something "may" be faster. Sure, anything "may" be faster. 
We can say that about literally anything. Walking to Alaska from the 
southernmost tip of Chile while dragging a grand piano behind you "may" 
be faster than flying, but probably isn't. Unless you have actual code 
backing up your assertions, they're pretty meaningless.

And the advantage of working code is that people might actually learn 
some Python too.

-- 
Steve