Find word by given characters

Wed Nov 4 15:03:31 EST 2020

On 04/11/2020 19:12, Avi Gross wrote:
> My comments at end:
> 
> -----Original Message-----
> From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On
> Behalf Of duncan smith
> Sent: Wednesday, November 4, 2020 1:09 PM
> To: python-list at python.org
> Subject: Re: Find word by given characters
> 
> On 04/11/2020 04:21, Avi Gross wrote:
>> Duncan, my comments below yours at end.
>>
>> ---YOURS---
>> The Counter approach only requires iterating over the letters once to 
>> construct the letters bag, then each word once to create the relevant 
>> word bag. After that it's (at worst) a couple of lookups and a 
>> comparison for each unique character in letters (for each word).
>>
>> Your approach requires iteration over the words to create the lists of 
>> characters. Then they are (potentially) iterated over multiple times 
>> looking for the characters in letters. There's also the cost of 
>> removing items from arbitrary positions in the lists. Also, it seems 
>> that the character frequencies must be equal, and your approach only 
>> ensures that the words contain at least the required numbers of
> characters.
>>
>> In computational terms, if the first approach is something like O(n+m) 
>> for n letters and words of length m, your algorithm is more like O(nm).
>> Not to say that it will be slower for all possible letters and 
>> dictionaries, but probably for any realistic cases and a lot slower 
>> for large enough dictionaries.
>>
>> Duncan
>>
>> --MINE---
>>
>> I appreciate your analysis. I have not looked at the "counter"
>> implementation and suspect it does some similar loops within, albeit 
>> it may be implemented in a compiled language like C++.
>>
> 
> Before the introduction of counters I would have used a dict to create a
> mapping of characters to counts. That would require iterating over the
> characters in a string only once to create / update the dict entries, and I
> doubt counters are implemented less efficiently than that.
> 
>> I did not write out my algorithm in Python but have done it for 
>> myself. It runs fast enough with most of the time spent in the slow I/O
> part.
>>
>> We can agree all algorithms have to read in all the words in a data file.
>> There may be ways to store the data such as in a binary tree and even 
>> ways to thus prune the search as once a node is reached where all 
>> required letters have been found, all further words qualify below that 
>> point. If you match say "instant" then instants and instantiation 
>> would be deeper in the tree and also qualify assuming extra letters are
> allowed.
> 
> I don't see how a binary tree would be useful. As I've pointed out in
> another post, there are other data structures that could be useful. What I
> had in mind was a trie (prefix tree). But it's only the distinct characters
> and frequencies that are relevant and so I'd exploit that (and one or two
> other things) to reduce space and improve search.
> 
> We may differ on
>> the requirement as I think that the number of repeats for something 
>> like a,t,t require to be at least as big as in "attend" but that 
>> "attention" with yet another "t" would also be OK. If I am wrong, 
>> fine, but I note the person requesting this has admitted a certain 
>> lack of credentials while also claiming he made up a scenario just for 
>> fun. So this is not actually a particularly worthy challenge let alone
> with a purpose.
>>
>> My impression is that the average word length would be something small 
>> like 5-7. The number of words in a dictionary might be 100,000 or 
>> more. So if you want efficiency, where do you get the bang for the buck?
>>
>> I would argue that a simple test run on all the words might often 
>> narrow the field to a much smaller number of answers like just a few 
>> thousand or even much less. Say you test for the presence of "aeiou" 
>> in words, in whatever order. That might be done from reading a file 
>> and filtering out a relatively few potential answers. You can save 
>> those for  second round to determine if they are fully qualified by 
>> any additional rules that may involve more expensive operations.
>>
> 
> Your proposed approach didn't involve any trees (or tries) or filtering of
> words. So I don't see how any of this justifies it.
> 
>> How fast (or slow) are regular expressions for this purpose? Obviously 
>> it depends on complexity and something like "^[^aeiou]*[aeiou] 
>> [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] [^aeiou]*[aeiou] 
>> [^aeiou]*$"
>>
>> would be easy to construct once but likely horribly inefficient in 
>> searching and a bit of overkill here. I suspect there is already some 
>> simple C function that could be used from within python that looks 
>> like findall(choices, word) that might return how many of the letters 
>> in choices were found in word and you simply comparer that to 
>> length(word) perhaps more efficiently.
>>
>> It looks easier to check if a character exists in one of the ways 
>> already discussed within python using a loop as discussed. Something 
>> as simple as
>> this:
>>
>>   needed = "aeiou"
>>   trying = "education"
>>   found = all([trying.find(each) >= 0  for each in needed ])
>>   print(found)
>>
>>   trying = "educated"
>>   found = all([trying.find(each) >= 0  for each in needed ])
>>   print(found)
>>   The above prints
>>
>> My point is you can use the above to winnow down possible answers and 
>> only subject that smaller number to one of many tests including using 
>> a counter, making a dictionary you can query (albeit probably slower 
>> for fairly short
>> words) and so on. 
>>
> 
> That still involves iterating over a word's characters repeatedly. And
> that's just to identify if a word is a possible match.
> 
>> Back to your other point, you suggest iterating over characters ( I 
>> happened to do it in a list) multiple times can result in duplication 
>> as the same characters are searched repeatedly. Sure. I simply note 
>> most words are short. How much overhead is there searching for say 
>> nine characters five times? Compare that to say creating a dictionary 
>> or other data structure once, and then making a hash out of each letter
> being searched for?
> 
> Exactly. If the words are short enough, and removing items from arbitrary
> positions in lists is quick enough, and dict creation and lookups are
> sufficiently expensive - then it might be quicker. But we have a reasonable
> idea which of these things are expensive and which are quick. If in doubt we
> can time it.
> 
>  I can
>> envision many possible ways to speed this up but many involve more use 
>> of memory or all other kinds of overhead such as initializing an 
>> object and calling various member functions and then deleting it or
> re-initializing it.
>> My suggestion of removing items from a short list is not ideal and 
>> depending on how a bag was otherwise implemented, I could see it being 
>> better but again, how much is saved if you have a nine letter word 
>> that may contain nine unique letters or even have repeated letters as 
>> in banana? You end up saving a copy of the letter (or a hashed 
>> version) plus some integer value (perhaps just a byte) for the count. 
>> I would need to see the implementation and do some speed tests, ...
>>
>> In real programming, there are many choices. This is a bit of a 
>> hypothetical but a part of me suggests the best answer for a quick 
>> prototype is something simple and not necessarily tuned for speed, just to
> show it can be done.
>> Then, if the code is used often and is slow, sure, enhance it. Back in 
>> my UNIX days, I might have started with a rather inefficient pipeline 
>> for my example like:
>>
>>   grep 'a' filename | grep 'e' | grep 'i' | grep 'o' | grep 'u' | ...
>>
>> That would produce a list of candidates, again not efficiently. I 
>> might have switched to using better regular expressions, perhaps using 
>> sed instead, or AWK or even PERL. But if it had to be fast production 
>> code, I might have used C then C++ or whatever latest tool I had.
>>
>> These days, some things run so fast that we get lazy. When doing a 
>> pipelined approach in R or Python that analyzes data, I often 
>> effectively have to apply a set of conditions to lots of data, one 
>> after another. Do you trim the number of columns in a data frame first 
>> or do you apply a condition that trims the number of rows first? 
>> Clearly if you need 5 columns out of 666, then chances are you narrow 
>> the columns so there is less useless copying in later parts of the 
>> pipeline. . But what if you have multiple conditions that each remove 
>> some rows and it depends on what the data looks like? For example, 
>> remove all rows in which one variable is NA and also all rows where 
>> another two variables are not equal to each other and so on. Isn't 
>> part of the consideration how expensive each comparison is? Imagine if 
>> you want to remove rows where a word in one column is not in the 
>> scrabble dictionary and your stupid function has to read it in each 
>> and every time! I would want that to be done last after cheaper 
>> methods removed say all words longer than
>> 10 characters or that had uppercase or numeric parts or spaces or were NA.
>>
>> So if the functionality in the  topic above was really being built, 
>> and the functionality was being called repeatedly as some game 
>> progressed, you definitely might want to speed it up, if nothing else, 
>> by reading the dictionary file into memory once, perhaps into a data 
>> structure like a modified binary tree which could be searched with 
>> tail recursion or short-cuts. And in some cases, you might even want 
>> odd things like to cache the results of earlier queries in a 
>> dictionary and skip  a full repeat search entirely.
>>
>> You might argue this situation always gets called with different
> arguments.
>> Perhaps but imagine writing a program where the computer plays against 
>> a human. The computer might look at every open region of a scrabble 
>> board and take the tiles already in their "hand" and add zero or more 
>> tiles indicating characters already on the game board that it wanted 
>> to extend into or beyond. Say it sees the letter "r" and wonders if it 
>> could place all letters immediately to the left of that "r" and then 
>> to see if the "r" could be in multiple places within the word formed 
>> and finally if it could make a word starting with "r". That could be 
>> multiple identical calls, perhaps only differing in the order of the 
>> characters it is searching for. And it might repeat similar calls in 
>> another part of the board that has a dangling "r", perhaps the same 
>> one it did not use before. Obviously, the goal here would be to find 
>> all potentially possible words it could actually fit into that slot so 
>> further analysis would happen and it would investigate which method 
>> provided more points. It may even look ahead to see if that move 
>> allowed the opponent a better-countermove and so on into deeper 
>> analyses.  Much of the expense of deciding what to actually do would be
> outside the functionality described.
>>
>> Admittedly, I see no reason for it to actually ask for these words as 
>> described that could be longer, but only proper permutations of the
> letters.
>> My point is I can envision places where caching earlier results would 
>> be potentially a great speedup. Heck, I suggest the problem could be 
>> turned around as the number of combinations might be computed and 
>> simply tested to see if they exist in a Python dictionary made from the
> external dictionary.
>>
>> Heck, why even write the main routine in an interpreted language when 
>> you can link in a function written to be even more efficient?
>>
>> But enhancements like that come with a cost. Would you pay a 
>> programmer an extra month to speed something like that up 5%? Maybe, 
>> but often not. And if you add in the typical costs in a large 
>> organization of various kinds of testing needed for a more complex or
> subtle approach, ...
>>
>> Avi
>>
> 
> So the upshot of this is that the use of counters is premature optimisation?
> I don't agree. I'm not even convinced that your proposed approach is
> simpler. We need to compare the frequencies of characters, and counters give
> us those frequencies directly. Then we compare them.
> Premature optimisation might be "write a trie class, then ...".
> 
> I like to start with clean code that's not obviously inefficient and that I
> might not have to revisit later (to tidy up or optimise). For me, that would
> be the approach using counters.
> 
> Duncan
> 
> ----MY COMMENTS on 11/02/2020---
> 
> Duncan, my main point is an academic discussion and I have no experience or
> complaints about using the excellent suggestions of solving a problem using
> a counter. Once a good implementation of something is available and reliable
> and efficient, anyone who knows how to use it can use it to write decent
> code.
> 
> The person posting this "problem"' seems to be learning by making up vague
> projects for themselves. When I have taught programming, the emphasis has
> been to learn the LANGUAGE first and how to do things in the core language
> and perhaps standard additions. Aspects like loops and lists are fairly
> standard in Python. Arguably you could suggest so are many aspects of
> object-oriented and functional programming these days. But generally (and
> perhaps wrongly) we often are expected to learn how to solve problems using
> the basics, often partially to learn how to compose a logical set of steps
> as an algorithm. Later we may learn that much of what we need has already
> been taken care of and you have modules or packages or libraries that often
> let you work at a higher level.
> 
> So an approach that effectively uses an abstraction that counts everything
> in one logical step and allows you to ask if one thing is a subset of
> another is probably even better from a programming perspective. I am not
> saying it is a premature optimization. It may turn out to not be optimal
> though given the scenario where small enough words dominate. It is like with
> dictionaries in Python where the inner method has to allocate space for a
> hash table of some size that can grow.  Given the nature of this problem,
> there is an obvious way to make a data structure that is more compact and
> probably faster.
> 
> Again, just as an academic thought experiment, what we require is the
> ability to store info about no more than 26 lower-case letters of the
> alphabet and count how many. I mean, for now, assume we are limited to
> ASCII, not all Unicode characters. So something as simple as a vector of
> fixed length 26 would be enough and the hash function would simply subtract
> a fixed number offset from the value of the character as a byte of 8 bits.
> The count value is very limited as the longest scrabble word might be on the
> order of length 20 and no word is likely to have more than 5 or so
> repetitions of any one character. So having a "counter" bag/dictionary may
> be overkill in this scenario. As for the subset methodology, it can be a
> simple vectorized operation as something that acts like a vector (perhaps
> using numpy) can be used to do something like A >= B to get a vector of
> True/False and require them all to be True if it is a subset.
> 
> So, to be clear, I see lots of ways to program most things. Some are
> arguably better for instructional purposes in a particular language and some
> are easier on the programmer and some are more efficient  and some are
> easier/harder to debug and so on. When I, or others, present an alternative,
> it often does not mean existing proposals are wrong or even bad. 
> 
> Your points about how often characters or list items are scanned are valid.
> But O(N) type arguments must also consider the cost of such access not just
> how many times. Also valid  are your concerns of costs in removing list
> items in random locations. And, no, my original example did not use exotic
> things like trees. But I have had many years of  experience in many
> languages and in each I develop a toolbag to work with. For longer lists
> where I wanted to be able to insert or delete at random, I might use
> something like linked lists that allow efficient changes without copying the
> entire list repeatedly. Of course, that works better  in languages that
> support something like pointers. For short lists like in this example, the
> overhead may be not worth it and a bit of copying may be as fast or faster
> and not worth worrying about.
> 
> I think I have said more than enough and won't continue this thread. There
> are many fairly decent ways of doing things  and this particular problem is
> not particularly interesting as I see little actual reason to build it. But
> having said that, I note lots of apps for my phone and elsewhere that are
> games for making words in many different ways, often by permuting a set of
> letters you can use. Underneath the hood, I suspect they have a dictionary
> they have to consult, and wonder how they implement their searches.
> --
> https://mail.python.org/mailman/listinfo/python-list
> 

You asked, "what would be wrong with this approach" and "am I missing
something about other approaches being better or more efficient". Those
are the questions I answered. I explained why I thought your approach
would be likely to be comparatively slow for realistic use cases. The OP
says he is coming back to Python after a lay off (having used Python2
before). So I didn't see this as a teaching exercise.  Rather, I showed
how I would do it in Python3.

Duncan