Fast full-text searching in Python (job for Whoosh?)

Dino dino at no.spam.ar
Mon Mar 6 07:40:29 EST 2023


Thank you for taking the time to write such a detailed answer, Avi. And 
apologies for not providing more info from the get go.

What I am trying to achieve here is supporting autocomplete (no pun 
intended) in a web form field, hence the -i case insensitive example in 
my initial question.

Your points are all good, and my original question was a bit rushed. I 
guess that the problem was that I saw this video:

https://www.youtube.com/watch?v=gRvZbYtwTeo&ab_channel=NextDayVideo

The idea that someone types into an input field and matches start 
dancing in the browser made me think that this was exactly what I 
needed, and hence I figured that asking here about Whoosh would be a 
good idea. I know realize that Whoosh would be overkill for my use-case, 
as a simple (case insensitive) query substring would get me 90% of what 
I want. Speed is in the order of a few milliseconds out of the box, 
which is chump change in the context of a web UI.

Thank you again for taking the time to look at my question

Dino

On 3/5/2023 10:56 PM, avi.e.gross at gmail.com wrote:
> Dino, Sending lots of data to an archived forum is not a great idea. I
> snipped most of it out below as not to replicate it.
> 
> Your question does not look difficult unless your real question is about
> speed. Realistically, much of the time spent generally is in reading in a
> file and the actual search can be quite rapid with a wide range of methods.
> 
> The data looks boring enough and seems to not have much structure other than
> one comma possibly separating two fields. Do you want the data as one wide
> filed or perhaps in two parts, which a CSV file is normally used to
> represent. Do you ever have questions like tell me all cars whose name
> begins with the letter D and has a V6 engine? If so, you may want more than
> a vanilla search.
> 
> What exactly do you want to search for? Is it a set of built-in searches or
> something the user types in?
> 
> The data seems to be sorted by the first field and then by the second and I
> did not check if some searches might be ambiguous. Can there be many entries
> containing III? Yep. Can the same words like Cruiser or Hybrid appear?
> 
> So is this a one-time search or multiple searches once loaded as in a
> service that stays resident and fields requests. The latter may be worth
> speeding up.
> 
> I don't NEED to know any of this but want you to know that the answer may
> depend on this and similar factors. We had a long discussion lately on
> whether to search using regular expressions or string methods. If your data
> is meant to be used once, you may not even need to read the file into
> memory, but read something like a line at a time and test it. Or, if you end
> up with more data like how many cylinders a car has, it may be time to read
> it in not just to a list of lines or such data structures, but get
> numpy/pandas involved and use their many search methods in something like a
> data.frame.
> 
> Of course if you are worried about portability, keep using Get Regular
> Expression Print.
> 
> Your example was:
> 
>       $ grep -i v60 all_cars_unique.csv
>       Genesis,GV60
>       Volvo,V60
> 
> You seem to have wanted case folding and that is NOT a normal search. And
> your search is matching anything on any line. If you wanted only a complete
> field, such as all text after a comma to the end of the line, you could use
> grep specifications to say that.
> 
> But once inside python, you would need to make choices depending on what
> kind of searches you want to allow but also things like do you want all
> matching lines shown if you search for say "a" ...
> 
> 


More information about the Python-list mailing list