Find word by given characters

Avi Gross avigross at verizon.net
Tue Nov 3 23:21:43 EST 2020


Duncan, my comments below yours at end.

---YOURS---
The Counter approach only requires iterating over the letters once to
construct the letters bag, then each word once to create the relevant word
bag. After that it's (at worst) a couple of lookups and a comparison for
each unique character in letters (for each word).

Your approach requires iteration over the words to create the lists of
characters. Then they are (potentially) iterated over multiple times looking
for the characters in letters. There's also the cost of removing items from
arbitrary positions in the lists. Also, it seems that the character
frequencies must be equal, and your approach only ensures that the words
contain at least the required numbers of characters.

In computational terms, if the first approach is something like O(n+m) for n
letters and words of length m, your algorithm is more like O(nm).
Not to say that it will be slower for all possible letters and dictionaries,
but probably for any realistic cases and a lot slower for large enough
dictionaries.

Duncan

--MINE---

I appreciate your analysis. I have not looked at the "counter"
implementation and suspect it does some similar loops within, albeit it may
be implemented in a compiled language like C++.

I did not write out my algorithm in Python but have done it for myself. It
runs fast enough with most of the time spent in the slow I/O part. 

We can agree all algorithms have to read in all the words in a data file.
There may be ways to store the data such as in a binary tree and even ways
to thus prune the search as once a node is reached where all required
letters have been found, all further words qualify below that point. If you
match say "instant" then instants and instantiation would be deeper in the
tree and also qualify assuming extra letters are allowed. We may differ on
the requirement as I think that the number of repeats for something like
a,t,t require to be at least as big as in "attend" but that "attention" with
yet another "t" would also be OK. If I am wrong, fine, but I note the person
requesting this has admitted a certain lack of credentials while also
claiming he made up a scenario just for fun. So this is not actually a
particularly worthy challenge let alone with a purpose.

My impression is that the average word length would be something small like
5-7. The number of words in a dictionary might be 100,000 or more. So if you
want efficiency, where do you get the bang for the buck? 

I would argue that a simple test run on all the words might often narrow the
field to a much smaller number of answers like just a few thousand or even
much less. Say you test for the presence of "aeiou" in words, in whatever
order. That might be done from reading a file and filtering out a relatively
few potential answers. You can save those for  second round to determine if
they are fully qualified by any additional rules that may involve more
expensive operations.

How fast (or slow) are regular expressions for this purpose? Obviously it
depends on complexity and something like 
"^[^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou]
[^aeiou]*[aeiou] [^aeiou]*$"

would be easy to construct once but likely horribly inefficient in searching
and a bit of overkill here. I suspect there is already some simple C
function that could be used from within python that looks like
findall(choices, word) that might return how many of the letters in choices
were found in word and you simply comparer that to length(word) perhaps more
efficiently.

It looks easier to check if a character exists in one of the ways already
discussed within python using a loop as discussed. Something as simple as
this:

  needed = "aeiou"
  trying = "education"
  found = all([trying.find(each) >= 0  for each in needed ])
  print(found)

  trying = "educated"
  found = all([trying.find(each) >= 0  for each in needed ])
  print(found)
  The above prints 

My point is you can use the above to winnow down possible answers and only
subject that smaller number to one of many tests including using a counter,
making a dictionary you can query (albeit probably slower for fairly short
words) and so on. 

Back to your other point, you suggest iterating over characters ( I happened
to do it in a list) multiple times can result in duplication as the same
characters are searched repeatedly. Sure. I simply note most words are
short. How much overhead is there searching for say nine characters five
times? Compare that to say creating a dictionary or other data structure
once, and then making a hash out of each letter being searched for? I can
envision many possible ways to speed this up but many involve more use of
memory or all other kinds of overhead such as initializing an object and
calling various member functions and then deleting it or re-initializing it.
My suggestion of removing items from a short list is not ideal and depending
on how a bag was otherwise implemented, I could see it being better but
again, how much is saved if you have a nine letter word that may contain
nine unique letters or even have repeated letters as in banana? You end up
saving a copy of the letter (or a hashed version) plus some integer value
(perhaps just a byte) for the count. I would need to see the implementation
and do some speed tests, ...

In real programming, there are many choices. This is a bit of a hypothetical
but a part of me suggests the best answer for a quick prototype is something
simple and not necessarily tuned for speed, just to show it can be done.
Then, if the code is used often and is slow, sure, enhance it. Back in my
UNIX days, I might have started with a rather inefficient pipeline for my
example like:

  grep 'a' filename | grep 'e' | grep 'i' | grep 'o' | grep 'u' | ...

That would produce a list of candidates, again not efficiently. I might have
switched to using better regular expressions, perhaps using sed instead, or
AWK or even PERL. But if it had to be fast production code, I might have
used C then C++ or whatever latest tool I had. 

These days, some things run so fast that we get lazy. When doing a pipelined
approach in R or Python that analyzes data, I often effectively have to
apply a set of conditions to lots of data, one after another. Do you trim
the number of columns in a data frame first or do you apply a condition that
trims the number of rows first? Clearly if you need 5 columns out of 666,
then chances are you narrow the columns so there is less useless copying in
later parts of the pipeline. . But what if you have multiple conditions that
each remove some rows and it depends on what the data looks like? For
example, remove all rows in which one variable is NA and also all rows where
another two variables are not equal to each other and so on. Isn't part of
the consideration how expensive each comparison is? Imagine if you want to
remove rows where a word in one column is not in the scrabble dictionary and
your stupid function has to read it in each and every time! I would want
that to be done last after cheaper methods removed say all words longer than
10 characters or that had uppercase or numeric parts or spaces or were NA.

So if the functionality in the  topic above was really being built, and the
functionality was being called repeatedly as some game progressed, you
definitely might want to speed it up, if nothing else, by reading the
dictionary file into memory once, perhaps into a data structure like a
modified binary tree which could be searched with tail recursion or
short-cuts. And in some cases, you might even want odd things like to cache
the results of earlier queries in a dictionary and skip  a full repeat
search entirely. 

You might argue this situation always gets called with different arguments.
Perhaps but imagine writing a program where the computer plays against a
human. The computer might look at every open region of a scrabble board and
take the tiles already in their "hand" and add zero or more tiles indicating
characters already on the game board that it wanted to extend into or
beyond. Say it sees the letter "r" and wonders if it could place all letters
immediately to the left of that "r" and then to see if the "r" could be in
multiple places within the word formed and finally if it could make a word
starting with "r". That could be multiple identical calls, perhaps only
differing in the order of the characters it is searching for. And it might
repeat similar calls in another part of the board that has a dangling "r",
perhaps the same one it did not use before. Obviously, the goal here would
be to find all potentially possible words it could actually fit into that
slot so further analysis would happen and it would investigate which method
provided more points. It may even look ahead to see if that move allowed the
opponent a better-countermove and so on into deeper analyses.  Much of the
expense of deciding what to actually do would be outside the functionality
described. 

Admittedly, I see no reason for it to actually ask for these words as
described that could be longer, but only proper permutations of the letters.
My point is I can envision places where caching earlier results would be
potentially a great speedup. Heck, I suggest the problem could be turned
around as the number of combinations might be computed and simply tested to
see if they exist in a Python dictionary made from the external dictionary. 

Heck, why even write the main routine in an interpreted language when you
can link in a function written to be even more efficient?

But enhancements like that come with a cost. Would you pay a programmer an
extra month to speed something like that up 5%? Maybe, but often not. And if
you add in the typical costs in a large organization of various kinds of
testing needed for a more complex or subtle approach, ...

Avi



More information about the Python-list mailing list