Fast full-text searching in Python (job for Whoosh?)

avi.e.gross at gmail.com avi.e.gross at gmail.com
Mon Mar 6 12:49:20 EST 2023


Thomas,

I may have missed any discussion where the OP explained more about proposed usage. If the program is designed to load the full data once, never get updates except by re-reading some file, and then handles multiple requests, then some things may be worth doing.

It looked to me, and I may well be wrong, like he wanted to search for a string anywhere in the text so a grep-like solution is a reasonable start with the actual data being stored as something like a list of character strings you can search "one line" at a time. I suspect a numpy variant may work faster.

And of course any search function he builds can be made to remember some or all previous searches using a cache decorator. That generally uses a dictionary for the search keys internally.

But using lots of dictionaries strikes me as only helping if you are searching for text anchored to the start of a line so if you ask for "Honda" you instead ask the dictionary called "h" and search perhaps just for "onda" then recombine the prefix in any results. But the example given wanted to match something like "V6" in middle of the text and I do not see how that would work as you would now need to search 26 dictionaries completely.



-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On Behalf Of Thomas Passin
Sent: Monday, March 6, 2023 11:03 AM
To: python-list at python.org
Subject: Re: Fast full-text searching in Python (job for Whoosh?)

On 3/6/2023 10:32 AM, Weatherby,Gerard wrote:
> Not sure if this is what Thomas meant, but I was also thinking dictionaries.
> 
> Dino could build a set of dictionaries with keys “a” through “z” that contain data with those letters in them. (I’m assuming case insensitive search) and then just search “v” if that’s what the user starts with.
> 
> Increased performance may be achieved by building dictionaries “aa”,”ab” ... “zz. And so on.
> 
> Of course, it’s trading CPU for memory usage, and there’s likely a point at which the cost of building dictionaries exceeds the savings in searching.

Chances are it would only be seconds at most to build the data cache, 
and then subsequent queries would respond very quickly.

> 
> From: Python-list <python-list-bounces+gweatherby=uchc.edu at python.org> on behalf of Thomas Passin <list1 at tompassin.net>
> Date: Sunday, March 5, 2023 at 9:07 PM
> To: python-list at python.org <python-list at python.org>
> Subject: Re: Fast full-text searching in Python (job for Whoosh?)
> 
> I would probably ingest the data at startup into a dictionary - or
> perhaps several depending on your access patterns - and then you will
> only need to to a fast lookup in one or more dictionaries.
> 
> If your access pattern would be easier with SQL queries, load the data
> into an SQLite database on startup.
> 
> IOW, do the bulk of the work once at startup.
> --
> https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!lnP5Hxid5mAgwg8o141SvmHPgCBU8zEaHDgukrQm2igozg5H5XLoIkAmrsHtRbZHR68oYAQpRFPh-Z9telM$>

-- 
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list