Find word by given characters

Wed Nov 4 14:12:59 EST 2020

My comments at end:

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On
Behalf Of duncan smith
Sent: Wednesday, November 4, 2020 1:09 PM
To: python-list at python.org
Subject: Re: Find word by given characters

On 04/11/2020 04:21, Avi Gross wrote:
> Duncan, my comments below yours at end.
> 
> ---YOURS---
> The Counter approach only requires iterating over the letters once to 
> construct the letters bag, then each word once to create the relevant 
> word bag. After that it's (at worst) a couple of lookups and a 
> comparison for each unique character in letters (for each word).
> 
> Your approach requires iteration over the words to create the lists of 
> characters. Then they are (potentially) iterated over multiple times 
> looking for the characters in letters. There's also the cost of 
> removing items from arbitrary positions in the lists. Also, it seems 
> that the character frequencies must be equal, and your approach only 
> ensures that the words contain at least the required numbers of
characters.
> 
> In computational terms, if the first approach is something like O(n+m) 
> for n letters and words of length m, your algorithm is more like O(nm).
> Not to say that it will be slower for all possible letters and 
> dictionaries, but probably for any realistic cases and a lot slower 
> for large enough dictionaries.
> 
> Duncan
> 
> --MINE---
> 
> I appreciate your analysis. I have not looked at the "counter"
> implementation and suspect it does some similar loops within, albeit 
> it may be implemented in a compiled language like C++.
> 

Before the introduction of counters I would have used a dict to create a
mapping of characters to counts. That would require iterating over the
characters in a string only once to create / update the dict entries, and I
doubt counters are implemented less efficiently than that.

> I did not write out my algorithm in Python but have done it for 
> myself. It runs fast enough with most of the time spent in the slow I/O
part.
> 
> We can agree all algorithms have to read in all the words in a data file.
> There may be ways to store the data such as in a binary tree and even 
> ways to thus prune the search as once a node is reached where all 
> required letters have been found, all further words qualify below that 
> point. If you match say "instant" then instants and instantiation 
> would be deeper in the tree and also qualify assuming extra letters are
allowed.

I don't see how a binary tree would be useful. As I've pointed out in
another post, there are other data structures that could be useful. What I
had in mind was a trie (prefix tree). But it's only the distinct characters
and frequencies that are relevant and so I'd exploit that (and one or two
other things) to reduce space and improve search.

We may differ on
> the requirement as I think that the number of repeats for something 
> like a,t,t require to be at least as big as in "attend" but that 
> "attention" with yet another "t" would also be OK. If I am wrong, 
> fine, but I note the person requesting this has admitted a certain 
> lack of credentials while also claiming he made up a scenario just for 
> fun. So this is not actually a particularly worthy challenge let alone
with a purpose.
> 
> My impression is that the average word length would be something small 
> like 5-7. The number of words in a dictionary might be 100,000 or 
> more. So if you want efficiency, where do you get the bang for the buck?
> 
> I would argue that a simple test run on all the words might often 
> narrow the field to a much smaller number of answers like just a few 
> thousand or even much less. Say you test for the presence of "aeiou" 
> in words, in whatever order. That might be done from reading a file 
> and filtering out a relatively few potential answers. You can save 
> those for  second round to determine if they are fully qualified by 
> any additional rules that may involve more expensive operations.
> 

Your proposed approach didn't involve any trees (or tries) or filtering of
words. So I don't see how any of this justifies it.

> How fast (or slow) are regular expressions for this purpose? Obviously 
> it depends on complexity and something like "^[^aeiou]*[aeiou] 
> [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] 
> [^aeiou]*$"
> 
> would be easy to construct once but likely horribly inefficient in 
> searching and a bit of overkill here. I suspect there is already some 
> simple C function that could be used from within python that looks 
> like findall(choices, word) that might return how many of the letters 
> in choices were found in word and you simply comparer that to 
> length(word) perhaps more efficiently.
> 
> It looks easier to check if a character exists in one of the ways 
> already discussed within python using a loop as discussed. Something 
> as simple as
> this:
> 
>   needed = "aeiou"
>   trying = "education"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
> 
>   trying = "educated"
>   found = all([trying.find(each) >= 0  for each in needed ])
>   print(found)
>   The above prints
> 
> My point is you can use the above to winnow down possible answers and 
> only subject that smaller number to one of many tests including using 
> a counter, making a dictionary you can query (albeit probably slower 
> for fairly short
> words) and so on. 
> 

That still involves iterating over a word's characters repeatedly. And
that's just to identify if a word is a possible match.

> Back to your other point, you suggest iterating over characters ( I 
> happened to do it in a list) multiple times can result in duplication 
> as the same characters are searched repeatedly. Sure. I simply note 
> most words are short. How much overhead is there searching for say 
> nine characters five times? Compare that to say creating a dictionary 
> or other data structure once, and then making a hash out of each letter
being searched for?

Exactly. If the words are short enough, and removing items from arbitrary
positions in lists is quick enough, and dict creation and lookups are
sufficiently expensive - then it might be quicker. But we have a reasonable
idea which of these things are expensive and which are quick. If in doubt we
can time it.

 I can
> envision many possible ways to speed this up but many involve more use 
> of memory or all other kinds of overhead such as initializing an 
> object and calling various member functions and then deleting it or
re-initializing it.
> My suggestion of removing items from a short list is not ideal and 
> depending on how a bag was otherwise implemented, I could see it being 
> better but again, how much is saved if you have a nine letter word 
> that may contain nine unique letters or even have repeated letters as 
> in banana? You end up saving a copy of the letter (or a hashed 
> version) plus some integer value (perhaps just a byte) for the count. 
> I would need to see the implementation and do some speed tests, ...
> 
> In real programming, there are many choices. This is a bit of a 
> hypothetical but a part of me suggests the best answer for a quick 
> prototype is something simple and not necessarily tuned for speed, just to
show it can be done.
> Then, if the code is used often and is slow, sure, enhance it. Back in 
> my UNIX days, I might have started with a rather inefficient pipeline 
> for my example like:
> 
>   grep 'a' filename | grep 'e' | grep 'i' | grep 'o' | grep 'u' | ...
> 
> That would produce a list of candidates, again not efficiently. I 
> might have switched to using better regular expressions, perhaps using 
> sed instead, or AWK or even PERL. But if it had to be fast production 
> code, I might have used C then C++ or whatever latest tool I had.
> 
> These days, some things run so fast that we get lazy. When doing a 
> pipelined approach in R or Python that analyzes data, I often 
> effectively have to apply a set of conditions to lots of data, one 
> after another. Do you trim the number of columns in a data frame first 
> or do you apply a condition that trims the number of rows first? 
> Clearly if you need 5 columns out of 666, then chances are you narrow 
> the columns so there is less useless copying in later parts of the 
> pipeline. . But what if you have multiple conditions that each remove 
> some rows and it depends on what the data looks like? For example, 
> remove all rows in which one variable is NA and also all rows where 
> another two variables are not equal to each other and so on. Isn't 
> part of the consideration how expensive each comparison is? Imagine if 
> you want to remove rows where a word in one column is not in the 
> scrabble dictionary and your stupid function has to read it in each 
> and every time! I would want that to be done last after cheaper 
> methods removed say all words longer than
> 10 characters or that had uppercase or numeric parts or spaces or were NA.
> 
> So if the functionality in the  topic above was really being built, 
> and the functionality was being called repeatedly as some game 
> progressed, you definitely might want to speed it up, if nothing else, 
> by reading the dictionary file into memory once, perhaps into a data 
> structure like a modified binary tree which could be searched with 
> tail recursion or short-cuts. And in some cases, you might even want 
> odd things like to cache the results of earlier queries in a 
> dictionary and skip  a full repeat search entirely.
> 
> You might argue this situation always gets called with different
arguments.
> Perhaps but imagine writing a program where the computer plays against 
> a human. The computer might look at every open region of a scrabble 
> board and take the tiles already in their "hand" and add zero or more 
> tiles indicating characters already on the game board that it wanted 
> to extend into or beyond. Say it sees the letter "r" and wonders if it 
> could place all letters immediately to the left of that "r" and then 
> to see if the "r" could be in multiple places within the word formed 
> and finally if it could make a word starting with "r". That could be 
> multiple identical calls, perhaps only differing in the order of the 
> characters it is searching for. And it might repeat similar calls in 
> another part of the board that has a dangling "r", perhaps the same 
> one it did not use before. Obviously, the goal here would be to find 
> all potentially possible words it could actually fit into that slot so 
> further analysis would happen and it would investigate which method 
> provided more points. It may even look ahead to see if that move 
> allowed the opponent a better-countermove and so on into deeper 
> analyses.  Much of the expense of deciding what to actually do would be
outside the functionality described.
> 
> Admittedly, I see no reason for it to actually ask for these words as 
> described that could be longer, but only proper permutations of the
letters.
> My point is I can envision places where caching earlier results would 
> be potentially a great speedup. Heck, I suggest the problem could be 
> turned around as the number of combinations might be computed and 
> simply tested to see if they exist in a Python dictionary made from the
external dictionary.
> 
> Heck, why even write the main routine in an interpreted language when 
> you can link in a function written to be even more efficient?
> 
> But enhancements like that come with a cost. Would you pay a 
> programmer an extra month to speed something like that up 5%? Maybe, 
> but often not. And if you add in the typical costs in a large 
> organization of various kinds of testing needed for a more complex or
subtle approach, ...
> 
> Avi
> 

So the upshot of this is that the use of counters is premature optimisation?
I don't agree. I'm not even convinced that your proposed approach is
simpler. We need to compare the frequencies of characters, and counters give
us those frequencies directly. Then we compare them.
Premature optimisation might be "write a trie class, then ...".

I like to start with clean code that's not obviously inefficient and that I
might not have to revisit later (to tidy up or optimise). For me, that would
be the approach using counters.

Duncan

----MY COMMENTS on 11/02/2020---

Duncan, my main point is an academic discussion and I have no experience or
complaints about using the excellent suggestions of solving a problem using
a counter. Once a good implementation of something is available and reliable
and efficient, anyone who knows how to use it can use it to write decent
code.

The person posting this "problem"' seems to be learning by making up vague
projects for themselves. When I have taught programming, the emphasis has
been to learn the LANGUAGE first and how to do things in the core language
and perhaps standard additions. Aspects like loops and lists are fairly
standard in Python. Arguably you could suggest so are many aspects of
object-oriented and functional programming these days. But generally (and
perhaps wrongly) we often are expected to learn how to solve problems using
the basics, often partially to learn how to compose a logical set of steps
as an algorithm. Later we may learn that much of what we need has already
been taken care of and you have modules or packages or libraries that often
let you work at a higher level.

So an approach that effectively uses an abstraction that counts everything
in one logical step and allows you to ask if one thing is a subset of
another is probably even better from a programming perspective. I am not
saying it is a premature optimization. It may turn out to not be optimal
though given the scenario where small enough words dominate. It is like with
dictionaries in Python where the inner method has to allocate space for a
hash table of some size that can grow.  Given the nature of this problem,
there is an obvious way to make a data structure that is more compact and
probably faster.

Again, just as an academic thought experiment, what we require is the
ability to store info about no more than 26 lower-case letters of the
alphabet and count how many. I mean, for now, assume we are limited to
ASCII, not all Unicode characters. So something as simple as a vector of
fixed length 26 would be enough and the hash function would simply subtract
a fixed number offset from the value of the character as a byte of 8 bits.
The count value is very limited as the longest scrabble word might be on the
order of length 20 and no word is likely to have more than 5 or so
repetitions of any one character. So having a "counter" bag/dictionary may
be overkill in this scenario. As for the subset methodology, it can be a
simple vectorized operation as something that acts like a vector (perhaps
using numpy) can be used to do something like A >= B to get a vector of
True/False and require them all to be True if it is a subset.

So, to be clear, I see lots of ways to program most things. Some are
arguably better for instructional purposes in a particular language and some
are easier on the programmer and some are more efficient  and some are
easier/harder to debug and so on. When I, or others, present an alternative,
it often does not mean existing proposals are wrong or even bad. 

Your points about how often characters or list items are scanned are valid.
But O(N) type arguments must also consider the cost of such access not just
how many times. Also valid  are your concerns of costs in removing list
items in random locations. And, no, my original example did not use exotic
things like trees. But I have had many years of  experience in many
languages and in each I develop a toolbag to work with. For longer lists
where I wanted to be able to insert or delete at random, I might use
something like linked lists that allow efficient changes without copying the
entire list repeatedly. Of course, that works better  in languages that
support something like pointers. For short lists like in this example, the
overhead may be not worth it and a bit of copying may be as fast or faster
and not worth worrying about.

I think I have said more than enough and won't continue this thread. There
are many fairly decent ways of doing things  and this particular problem is
not particularly interesting as I see little actual reason to build it. But
having said that, I note lots of apps for my phone and elsewhere that are
games for making words in many different ways, often by permuting a set of
letters you can use. Underneath the hood, I suspect they have a dictionary
they have to consult, and wonder how they implement their searches.
--
https://mail.python.org/mailman/listinfo/python-list