Fast full-text searching in Python (job for Whoosh?)

avi.e.gross at gmail.com avi.e.gross at gmail.com
Sun Mar 5 22:56:26 EST 2023


Dino, Sending lots of data to an archived forum is not a great idea. I
snipped most of it out below as not to replicate it.

Your question does not look difficult unless your real question is about
speed. Realistically, much of the time spent generally is in reading in a
file and the actual search can be quite rapid with a wide range of methods.

The data looks boring enough and seems to not have much structure other than
one comma possibly separating two fields. Do you want the data as one wide
filed or perhaps in two parts, which a CSV file is normally used to
represent. Do you ever have questions like tell me all cars whose name
begins with the letter D and has a V6 engine? If so, you may want more than
a vanilla search.

What exactly do you want to search for? Is it a set of built-in searches or
something the user types in?

The data seems to be sorted by the first field and then by the second and I
did not check if some searches might be ambiguous. Can there be many entries
containing III? Yep. Can the same words like Cruiser or Hybrid appear? 

So is this a one-time search or multiple searches once loaded as in a
service that stays resident and fields requests. The latter may be worth
speeding up.

I don't NEED to know any of this but want you to know that the answer may
depend on this and similar factors. We had a long discussion lately on
whether to search using regular expressions or string methods. If your data
is meant to be used once, you may not even need to read the file into
memory, but read something like a line at a time and test it. Or, if you end
up with more data like how many cylinders a car has, it may be time to read
it in not just to a list of lines or such data structures, but get
numpy/pandas involved and use their many search methods in something like a
data.frame.

Of course if you are worried about portability, keep using Get Regular
Expression Print.

Your example was:

     $ grep -i v60 all_cars_unique.csv
     Genesis,GV60
     Volvo,V60

You seem to have wanted case folding and that is NOT a normal search. And
your search is matching anything on any line. If you wanted only a complete
field, such as all text after a comma to the end of the line, you could use
grep specifications to say that.

But once inside python, you would need to make choices depending on what
kind of searches you want to allow but also things like do you want all
matching lines shown if you search for say "a" ...




-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Dino
Sent: Saturday, March 4, 2023 10:47 PM
To: python-list at python.org
Subject: Re: Fast full-text searching in Python (job for Whoosh?)


Here's the complete data file should anyone care.

Acura,CL
Acura,ILX
Acura,Integra
Acura,Legend
<SNIP>
smart,fortwo electric drive
smart,fortwo electric drive cabrio

-- 
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list