"Is-it-a-word-module"

Warren Postma embed at geocities.com
Mon May 15 11:04:41 EDT 2000


> In a indexing-module I now index alot of "words" that actually are
> just a meaningless stream of characters, stuff like "translated" URLs,
> other stuff generated by Internet-robots of some kind. At first I
> thought I could just lookup each "word" in a dictionary, but then I
> realized that alot of stuff, like names like my own, are not in the
> dictionary but should be indexed. I therefore want some way of
> guessing if a bunch of character actually could be a meaningful word.

If all you want to filter out is URLs and CGI-jumble, why not include
anything as a word that contains only letters, or at most one piece of
punctuation (can't, won't). If your data is all in English, that'll do
nicely. If not, and you want to be thoroughly international about it, you've
got a nasty problem on your hands. Your email address showed up as garble on
my screen, for example, because it's not using the same code page as me.

Warren





More information about the Python-list mailing list