[Tutor] A problem: scanning for keywords in a bunch of text?

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Thu, 2 May 2002 00:13:23 -0700 (PDT)


On Mon, 29 Apr 2002, Daniel Coughlin wrote:

> This is just a thought, and I might be entirely off base, but can you
> process the the strings prior to searching through them? If so then you
> should be able to create some kind of numerical index on each chunk of
> searchable data and that could make your scanning go faster. I have
> thought about this for about 20 minutes so it might not make too much
> sense or be very clear.

No, this makes perfect sense.  I think that's a great idea.  Python has a
function called intern() that I can use to do this:

###
>>> print intern.__doc__
intern(string) -> string

``Intern'' the given string.  This enters the string in the (global)
table of interned strings whose purpose is to speed up dictionary lookups.
Return the string itself or the previously interned string object with the
same value.
###


I didn't think about using it before.  It must have been some unconcious
stigma from previous experience with Java, I think.  Bug 4035345, to be
exact.

    http://java.sun.com/products/jdk/1.1/knownbugs/runtime.html

*grin* But I'll try this out and see if it speeds things up for me.
It'll definitely reduce the amount of data I'm throwing around.



>   If you could process each searchable text (it looks like you might be
> seaching through abstracts?) into a list of numbers, you could then
> search that list of numbers. And it would be faster then searching the
> whole text.

Yes, that's what I'm doing.  I need to be careful about preserving the
ability to match sequences, because some of my keywords are not just
words, but really long phrases.

The one problem I'm thinking about is, later, I'd like to try extending
this so it's not such an exact match.  But I'm starting to think I should
just use Udi Manber's 'agrep' approximate string-matching utility, now
that I've read a bit more about it.

    http://citeseer.nj.nec.com/muth96approximate.html



> It also looks a little less complicated than the suffix trees ;-)

Yeah, I don't know what I was thinking.  I'm surprised but the Suffix
Trees actually work pretty well... but it takes forever to free() the darn
things afterwards.


Thanks again for the help!