Fast full-text searching in Python (job for Whoosh?)

Thomas Passin list1 at tompassin.net
Mon Mar 6 13:38:15 EST 2023


On 3/6/2023 12:49 PM, avi.e.gross at gmail.com wrote:
> Thomas,
> 
> I may have missed any discussion where the OP explained more about proposed usage. If the program is designed to load the full data once, never get updates except by re-reading some file, and then handles multiple requests, then some things may be worth doing.
> 
> It looked to me, and I may well be wrong, like he wanted to search for a string anywhere in the text so a grep-like solution is a reasonable start with the actual data being stored as something like a list of character strings you can search "one line" at a time. I suspect a numpy variant may work faster.
> 
> And of course any search function he builds can be made to remember some or all previous searches using a cache decorator. That generally uses a dictionary for the search keys internally.
> 
> But using lots of dictionaries strikes me as only helping if you are searching for text anchored to the start of a line so if you ask for "Honda" you instead ask the dictionary called "h" and search perhaps just for "onda" then recombine the prefix in any results. But the example given wanted to match something like "V6" in middle of the text and I do not see how that would work as you would now need to search 26 dictionaries completely.

Well, that's the question, isn't it?  Just how is this expected to be 
used?  I didn't read the initial posting that carefully, and I may have 
missed something that makes a difference.

The OP gives as an example a user entering a string ("v60").  The 
example is for a model designation.  If we know that this entry box will 
only receive model, then I would populate a dictionary using the model 
numbers as keys.  The number of distinct keys will probably not be that 
large.

For example, highly simplified of course:

 >>> models = {'v60': 'Volvo', 'GV60': 'Genesis', 'cl': 'Acura'}
 >>> entry = '60'
 >>> candidates = (m for m in models.keys() if entry in m)
 >>> list(candidates)
['v60', 'GV60']

The keys would be lower-cased.  A separate dictionary would give the 
complete string with the desired casing.  The values could be object 
references to the complete information.  If there might be several 
different models models with the same key, then the values could be 
lists or dictionaries and one would need to do some disambiguation, but 
that should be simple or quick.

It all depends on the planned access patterns.  If the OP really wants 
full-text search in the complete unstructured data file, then yes, a 
full text indexer of some kind will be useful.  Whoosh certainly looks 
good though I have not used it.  But for populating dropdown lists in 
web forms, most likely the design of the form will provide a structure 
for the various searches.

> -----Original Message-----
> From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Thomas Passin
> Sent: Monday, March 6, 2023 11:03 AM
> To: python-list at python.org
> Subject: Re: Fast full-text searching in Python (job for Whoosh?)
> 
> On 3/6/2023 10:32 AM, Weatherby,Gerard wrote:
>> Not sure if this is what Thomas meant, but I was also thinking dictionaries.
>>
>> Dino could build a set of dictionaries with keys “a” through “z” that contain data with those letters in them. (I’m assuming case insensitive search) and then just search “v” if that’s what the user starts with.
>>
>> Increased performance may be achieved by building dictionaries “aa”,”ab” ... “zz. And so on.
>>
>> Of course, it’s trading CPU for memory usage, and there’s likely a point at which the cost of building dictionaries exceeds the savings in searching.
> 
> Chances are it would only be seconds at most to build the data cache,
> and then subsequent queries would respond very quickly.
> 
>>
>> From: Python-list <python-list-bounces+gweatherby=uchc.edu at python.org> on behalf of Thomas Passin <list1 at tompassin.net>
>> Date: Sunday, March 5, 2023 at 9:07 PM
>> To: python-list at python.org <python-list at python.org>
>> Subject: Re: Fast full-text searching in Python (job for Whoosh?)
>>
>> I would probably ingest the data at startup into a dictionary - or
>> perhaps several depending on your access patterns - and then you will
>> only need to to a fast lookup in one or more dictionaries.
>>
>> If your access pattern would be easier with SQL queries, load the data
>> into an SQLite database on startup.
>>
>> IOW, do the bulk of the work once at startup.
>> --
>> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$>
> 



More information about the Python-list mailing list