Indexing HTML!

Martin Christensen knightsofspamalot-factotum at gvdnet.dk
Sat Dec 28 17:15:47 EST 2002


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>>>>> "John" == John  <johng2001 at rediffmail.com> writes:
John> I have been struggling for the past few days to get this done. I
John> have a few small document (HTML) collections, each of which will
John> be exposed on an independent simplistic intranet site (Apache on
John> Linux). I need some indexing solutions.

As I wrote yesterday in a different thread, I'm working on a
keyword-based search engine to work on top of relational databases.
For this purpose I have some code that you might be able to continue
working from. I've built the foundation for an index such as the one
you're looking for, and with a bit of hacking you should be able to
make use of it, I'd guess.

The full-text indexer is prepared to handle different 'text munchers'
that apply different filters to texts to prepare them for processing.
Such filters might remove HTML code. It should be fairly easy to code
using regular expressions, I should think.

In its entirety it probably won't be the best choice for your use
because of its focus on relational databases, but what code I have
I'll gladly share (after I clean it up I'm releasing it under the
GPL), and in a couple of weeks I can give you a technical report to
help you better grok that code.

While you're at it, a paper you might want to look at is 'Inverted
Files Versus Signature Files for Text Indexing' by Zobel, Moffat and
Ramamohanarao
(http://www.cs.arizona.edu/people/tods/accepted/1998/ZobelInverted.ps).
In its current state, my index does not handle updating of existing
data at all, nor removal of data, both of which are absolutely
necessary unless you want to recreate the index every time something
changes (which might be realistic in your case, as you describe it).

Let me know if it sounds interesting to you. I'm sure many people
would benefit from your work if you develop it in the direction that
you yourself need.

Martin

- -- 
Homepage:       http://www.cs.auc.dk/~factotum/
GPG public key: http://www.cs.auc.dk/~factotum/gpgkey.txt
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Using Mailcrypt+GnuPG <http://www.gnupg.org>

iEYEARECAAYFAj4OIpMACgkQYu1fMmOQldU6PgCgkmIhUV288cKuqc0bWshda4NL
PccAoMLMh8PSKawMPsBOqR/xEOElejji
=LJJW
-----END PGP SIGNATURE-----



More information about the Python-list mailing list