Fast full-text searching in Python (job for Whoosh?)

avi.e.gross at gmail.com avi.e.gross at gmail.com
Mon Mar 6 13:45:34 EST 2023


Ah, thanks Dino. Autocomplete within a web page can be an interesting
scenario but also a daunting one.

Now, do you mean you have a web page with a text field, initially I suppose
empty, and the user types a single character and rapidly a drop-down list or
something is created and shown? And as they type, it may shrink? And as soon
as they select one, it is replaced in the text field and done?

If your form has an attached function written in JavaScript, some might load
your data into the browser and do all that work from within. No python
needed.

Now if your scenario is similar to the above, or perhaps the user needs to
ask for autocompletion by using tab or something, and you want to keep
sending requests to a server, you can of course use any language on the
server. BUT I would be cautious in such a design.

My guess is you autocomplete on every keystroke and the user may well type
multiple characters resulting in multiple requests for your program. Is a
new one called every time or is it a running service. If the latter, it pays
to read in the data once and then carefully serve it. But when you get just
the letter "h" you may not want to send and process a thousand results but
limit It to say the first N. If they then add an o to make a ho, You may not
need to do much if it is anchored to the start except to search in the
results of the previous search rather than the whole data.

But have you done some searching on how autocomplete from a fixed corpus is
normally done? It is a quite common thing.


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Dino
Sent: Monday, March 6, 2023 7:40 AM
To: python-list at python.org
Subject: Re: RE: Fast full-text searching in Python (job for Whoosh?)

Thank you for taking the time to write such a detailed answer, Avi. And 
apologies for not providing more info from the get go.

What I am trying to achieve here is supporting autocomplete (no pun 
intended) in a web form field, hence the -i case insensitive example in 
my initial question.

Your points are all good, and my original question was a bit rushed. I 
guess that the problem was that I saw this video:

https://www.youtube.com/watch?v=gRvZbYtwTeo&ab_channel=NextDayVideo

The idea that someone types into an input field and matches start 
dancing in the browser made me think that this was exactly what I 
needed, and hence I figured that asking here about Whoosh would be a 
good idea. I know realize that Whoosh would be overkill for my use-case, 
as a simple (case insensitive) query substring would get me 90% of what 
I want. Speed is in the order of a few milliseconds out of the box, 
which is chump change in the context of a web UI.

Thank you again for taking the time to look at my question

Dino

On 3/5/2023 10:56 PM, avi.e.gross at gmail.com wrote:
> Dino, Sending lots of data to an archived forum is not a great idea. I
> snipped most of it out below as not to replicate it.
> 
> Your question does not look difficult unless your real question is about
> speed. Realistically, much of the time spent generally is in reading in a
> file and the actual search can be quite rapid with a wide range of
methods.
> 
> The data looks boring enough and seems to not have much structure other
than
> one comma possibly separating two fields. Do you want the data as one wide
> filed or perhaps in two parts, which a CSV file is normally used to
> represent. Do you ever have questions like tell me all cars whose name
> begins with the letter D and has a V6 engine? If so, you may want more
than
> a vanilla search.
> 
> What exactly do you want to search for? Is it a set of built-in searches
or
> something the user types in?
> 
> The data seems to be sorted by the first field and then by the second and
I
> did not check if some searches might be ambiguous. Can there be many
entries
> containing III? Yep. Can the same words like Cruiser or Hybrid appear?
> 
> So is this a one-time search or multiple searches once loaded as in a
> service that stays resident and fields requests. The latter may be worth
> speeding up.
> 
> I don't NEED to know any of this but want you to know that the answer may
> depend on this and similar factors. We had a long discussion lately on
> whether to search using regular expressions or string methods. If your
data
> is meant to be used once, you may not even need to read the file into
> memory, but read something like a line at a time and test it. Or, if you
end
> up with more data like how many cylinders a car has, it may be time to
read
> it in not just to a list of lines or such data structures, but get
> numpy/pandas involved and use their many search methods in something like
a
> data.frame.
> 
> Of course if you are worried about portability, keep using Get Regular
> Expression Print.
> 
> Your example was:
> 
>       $ grep -i v60 all_cars_unique.csv
>       Genesis,GV60
>       Volvo,V60
> 
> You seem to have wanted case folding and that is NOT a normal search. And
> your search is matching anything on any line. If you wanted only a
complete
> field, such as all text after a comma to the end of the line, you could
use
> grep specifications to say that.
> 
> But once inside python, you would need to make choices depending on what
> kind of searches you want to allow but also things like do you want all
> matching lines shown if you search for say "a" ...
> 
> 
-- 
https://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list