Find word by given characters

Wed Nov 4 13:09:19 EST 2020

On 04/11/2020 04:21, Avi Gross wrote:
> Duncan, my comments below yours at end.
> 
> ---YOURS---
> The Counter approach only requires iterating over the letters once to
> construct the letters bag, then each word once to create the relevant word
> bag. After that it's (at worst) a couple of lookups and a comparison for
> each unique character in letters (for each word).
> 
> Your approach requires iteration over the words to create the lists of
> characters. Then they are (potentially) iterated over multiple times looking
> for the characters in letters. There's also the cost of removing items from
> arbitrary positions in the lists. Also, it seems that the character
> frequencies must be equal, and your approach only ensures that the words
> contain at least the required numbers of characters.
> 
> In computational terms, if the first approach is something like O(n+m) for n
> letters and words of length m, your algorithm is more like O(nm).
> Not to say that it will be slower for all possible letters and dictionaries,
> but probably for any realistic cases and a lot slower for large enough
> dictionaries.
> 
> Duncan
> 
> --MINE---
> 
> I appreciate your analysis. I have not looked at the "counter"
> implementation and suspect it does some similar loops within, albeit it may
> be implemented in a compiled language like C++.
> 

Before the introduction of counters I would have used a dict to create a
mapping of characters to counts. That would require iterating over the
characters in a string only once to create / update the dict entries,
and I doubt counters are implemented less efficiently than that.

> I did not write out my algorithm in Python but have done it for myself. It
> runs fast enough with most of the time spent in the slow I/O part. 
> 
> We can agree all algorithms have to read in all the words in a data file.
> There may be ways to store the data such as in a binary tree and even ways
> to thus prune the search as once a node is reached where all required
> letters have been found, all further words qualify below that point. If you
> match say "instant" then instants and instantiation would be deeper in the
> tree and also qualify assuming extra letters are allowed. 

I don't see how a binary tree would be useful. As I've pointed out in
another post, there are other data structures that could be useful. What
I had in mind was a trie (prefix tree). But it's only the distinct
characters and frequencies that are relevant and so I'd exploit that
(and one or two other things) to reduce space and improve search.

We may differ on
> the requirement as I think that the number of repeats for something like
> a,t,t require to be at least as big as in "attend" but that "attention" with
> yet another "t" would also be OK. If I am wrong, fine, but I note the person
> requesting this has admitted a certain lack of credentials while also
> claiming he made up a scenario just for fun. So this is not actually a
> particularly worthy challenge let alone with a purpose.
> 
> My impression is that the average word length would be something small like
> 5-7. The number of words in a dictionary might be 100,000 or more. So if you
> want efficiency, where do you get the bang for the buck? 
> 
> I would argue that a simple test run on all the words might often narrow the
> field to a much smaller number of answers like just a few thousand or even
> much less. Say you test for the presence of "aeiou" in words, in whatever
> order. That might be done from reading a file and filtering out a relatively
> few potential answers. You can save those for  second round to determine if
> they are fully qualified by any additional rules that may involve more
> expensive operations.
> 

Your proposed approach didn't involve any trees (or tries) or filtering
of words. So I don't see how any of this justifies it.

> How fast (or slow) are regular expressions for this purpose? Obviously it
> depends on complexity and something like 
> "^[^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou]
> [^aeiou]*[aeiou] [^aeiou]*$"
> 
> would be easy to construct once but likely horribly inefficient in searching
> and a bit of overkill here. I suspect there is already some simple C
> function that could be used from within python that looks like
> findall(choices, word) that might return how many of the letters in choices
> were found in word and you simply comparer that to length(word) perhaps more
> efficiently.
> 
> It looks easier to check if a character exists in one of the ways already
> discussed within python using a loop as discussed. Something as simple as
> this:
> 
>   needed = "aeiou"
>   trying = "education"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
> 
>   trying = "educated"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
>   The above prints 
> 
> My point is you can use the above to winnow down possible answers and only
> subject that smaller number to one of many tests including using a counter,
> making a dictionary you can query (albeit probably slower for fairly short
> words) and so on. 
> 

That still involves iterating over a word's characters repeatedly. And
that's just to identify if a word is a possible match.

> Back to your other point, you suggest iterating over characters ( I happened
> to do it in a list) multiple times can result in duplication as the same
> characters are searched repeatedly. Sure. I simply note most words are
> short. How much overhead is there searching for say nine characters five
> times? Compare that to say creating a dictionary or other data structure
> once, and then making a hash out of each letter being searched for?

Exactly. If the words are short enough, and removing items from
arbitrary positions in lists is quick enough, and dict creation and
lookups are sufficiently expensive - then it might be quicker. But we
have a reasonable idea which of these things are expensive and which are
quick. If in doubt we can time it.

 I can
> envision many possible ways to speed this up but many involve more use of
> memory or all other kinds of overhead such as initializing an object and
> calling various member functions and then deleting it or re-initializing it.
> My suggestion of removing items from a short list is not ideal and depending
> on how a bag was otherwise implemented, I could see it being better but
> again, how much is saved if you have a nine letter word that may contain
> nine unique letters or even have repeated letters as in banana? You end up
> saving a copy of the letter (or a hashed version) plus some integer value
> (perhaps just a byte) for the count. I would need to see the implementation
> and do some speed tests, ...
> 
> In real programming, there are many choices. This is a bit of a hypothetical
> but a part of me suggests the best answer for a quick prototype is something
> simple and not necessarily tuned for speed, just to show it can be done.
> Then, if the code is used often and is slow, sure, enhance it. Back in my
> UNIX days, I might have started with a rather inefficient pipeline for my
> example like:
> 
>   grep 'a' filename | grep 'e' | grep 'i' | grep 'o' | grep 'u' | ...
> 
> That would produce a list of candidates, again not efficiently. I might have
> switched to using better regular expressions, perhaps using sed instead, or
> AWK or even PERL. But if it had to be fast production code, I might have
> used C then C++ or whatever latest tool I had. 
> 
> These days, some things run so fast that we get lazy. When doing a pipelined
> approach in R or Python that analyzes data, I often effectively have to
> apply a set of conditions to lots of data, one after another. Do you trim
> the number of columns in a data frame first or do you apply a condition that
> trims the number of rows first? Clearly if you need 5 columns out of 666,
> then chances are you narrow the columns so there is less useless copying in
> later parts of the pipeline. . But what if you have multiple conditions that
> each remove some rows and it depends on what the data looks like? For
> example, remove all rows in which one variable is NA and also all rows where
> another two variables are not equal to each other and so on. Isn't part of
> the consideration how expensive each comparison is? Imagine if you want to
> remove rows where a word in one column is not in the scrabble dictionary and
> your stupid function has to read it in each and every time! I would want
> that to be done last after cheaper methods removed say all words longer than
> 10 characters or that had uppercase or numeric parts or spaces or were NA.
> 
> So if the functionality in the  topic above was really being built, and the
> functionality was being called repeatedly as some game progressed, you
> definitely might want to speed it up, if nothing else, by reading the
> dictionary file into memory once, perhaps into a data structure like a
> modified binary tree which could be searched with tail recursion or
> short-cuts. And in some cases, you might even want odd things like to cache
> the results of earlier queries in a dictionary and skip  a full repeat
> search entirely. 
> 
> You might argue this situation always gets called with different arguments.
> Perhaps but imagine writing a program where the computer plays against a
> human. The computer might look at every open region of a scrabble board and
> take the tiles already in their "hand" and add zero or more tiles indicating
> characters already on the game board that it wanted to extend into or
> beyond. Say it sees the letter "r" and wonders if it could place all letters
> immediately to the left of that "r" and then to see if the "r" could be in
> multiple places within the word formed and finally if it could make a word
> starting with "r". That could be multiple identical calls, perhaps only
> differing in the order of the characters it is searching for. And it might
> repeat similar calls in another part of the board that has a dangling "r",
> perhaps the same one it did not use before. Obviously, the goal here would
> be to find all potentially possible words it could actually fit into that
> slot so further analysis would happen and it would investigate which method
> provided more points. It may even look ahead to see if that move allowed the
> opponent a better-countermove and so on into deeper analyses.  Much of the
> expense of deciding what to actually do would be outside the functionality
> described. 
> 
> Admittedly, I see no reason for it to actually ask for these words as
> described that could be longer, but only proper permutations of the letters.
> My point is I can envision places where caching earlier results would be
> potentially a great speedup. Heck, I suggest the problem could be turned
> around as the number of combinations might be computed and simply tested to
> see if they exist in a Python dictionary made from the external dictionary. 
> 
> Heck, why even write the main routine in an interpreted language when you
> can link in a function written to be even more efficient?
> 
> But enhancements like that come with a cost. Would you pay a programmer an
> extra month to speed something like that up 5%? Maybe, but often not. And if
> you add in the typical costs in a large organization of various kinds of
> testing needed for a more complex or subtle approach, ...
> 
> Avi
> 

So the upshot of this is that the use of counters is premature
optimisation? I don't agree. I'm not even convinced that your proposed
approach is simpler. We need to compare the frequencies of characters,
and counters give us those frequencies directly. Then we compare them.
Premature optimisation might be "write a trie class, then ...".

I like to start with clean code that's not obviously inefficient and
that I might not have to revisit later (to tidy up or optimise). For me,
that would be the approach using counters.

Duncan