Fast full-text searching in Python (job for Whoosh?)

avi.e.gross at gmail.com avi.e.gross at gmail.com
Mon Mar 6 12:37:09 EST 2023


Gerard,

I was politely pointing out how it was more than the minimum necessary and
might gets repeated multiple times as people replied. The storage space is a
resource someone else provides and I prefer not abusing it.

However, since the OP seems to be asking a question focused on how long it
takes to search using possible techniques, indeed some people would want the
entire data to test with.

In my personal view, the a snippet of the data is what I need to see how it
is organized and then what I need way more is some idea for what kind of
searching is needed.

If I was told there would be a web page allowing users to search a web
service hosting the data on a server with one process called as much as
needed that spawned threads to handle the task, I might see it as very
worthwhile to read in the data once into some data structure that allows
rapid searches over and over.  If it is an app called ONCE as a whole for
each result, as in the grep example, why bother and just read a line at a
time and be done with it.

My suggestion remains my preference. The discussion is archived. Messages
are can optimally be trimmed as needed and not allowed to contain the full
contents of the last twenty replies back and forth unless that is needed.
Larger amounts of data can be offered to share and if wanted, can be posted
or send to someone asking for it or placed in some public accessible place.

But my preference may not be relevant as the forum has hosts or owners and
it is what they want that counts.

The data this time was not really gigantic. But I often work with data from
a CSV that has hundreds of columns and hundreds of thousands or more rows,
with some of the columns containing large amounts of text. But I may be
interested in how to work with say just half a dozen columns and for the
purposes of my question here, perhaps a hundred representative rows. Should
I share everything, or maybe save the subset and only share that?

This is not about python as a language but about expressing ideas and
opinions on a public forum with limited resources. Yes, over the years, my
combined posts probably use far more archival space. We are not asked to be
sparse, just not be wasteful. 

The OP may consider what he is working with as a LOT of data but it really
isn't by modern standards. 

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com at python.org> On
Behalf Of Weatherby,Gerard
Sent: Monday, March 6, 2023 10:35 AM
To: python-list at python.org
Subject: Re: Fast full-text searching in Python (job for Whoosh?)

"Dino, Sending lots of data to an archived forum is not a great idea. I
snipped most of it out below as not to replicate it."

Surely in 2023, storage is affordable enough there's no need to criticize
Dino for posting complete information. If mailing space is a consideration,
we could all help by keeping our replies short and to the point.

-- 
https://mail.python.org/mailman/listinfo/python-list



More information about the Python-list mailing list