Top 10 Language Constructs (Phyton)

Aahz Maruch aahz at netcom.com
Sat Jul 15 10:37:57 EDT 2000


In article <8kpavi04id at news1.newsguy.com>,
Alex Martelli <alex at magenta.com> wrote:
>"Aahz Maruch" <aahz at netcom.com> wrote in message
>news:8knng5$eue$1 at slb3.atl.mindspring.net...
>> In article <8kn71u02nn4 at news2.newsguy.com>,
>> Alex Martelli <alex at magenta.com> wrote:
>>>"Dinu C. Gherman" <gherman at darwin.in-berlin.de> wrote in message
>>>news:396EE7BC.DA2FA7B3 at darwin.in-berlin.de...
>>>>
>>>> I doubt this will be very useful, like the answer to the
>>>> question: "Which are the ten most used words in English?".
>>>
>>>That one is pretty useful, actually (it identifies the all-too-
>>>frequent words, like 'the', 'of', etc, which a full-text search
>>>engine had better just skip over:-).
>>
>> That's wrong.  How do you search for the phrase "the White House"?
>
>You get all adjacencies of 'white' and 'house', assuming your search
>tables are case-insensitive (which they should not be, for umpteen
>reasons, but that's another thread, I guess).  This only adds a modest
>number of false positives: since you were going to get references to
>"the white house at the corner of Elm and Main" anyway, it's no great
>hardship to also get any "a white house, quite pretty, with a red
>roof".

Here's a better example: "The Cat in the Hat"  With two adjacent noise
words, your false positive rate goes up.

>But meanwhile you're protecting yourself against losing many relevant
>positives due to quirks of English syntax and idiom.  "White
>House sources denied the allegations" and "The White House denied
>the allegations" are basically equivalent; "the", like all such
>syntax-marker words, tends to appear or disappear from phrases
>depending, unsurprisingly, on syntax (and idiom) issues.

Note that many people are not searching for a random document that
contains something similar to a phrase, they are searching for a
document that contains a specific phrase.  (E.g. "the tree of liberty")

>And your tables shrink usefully.  You can invest the saved space to
>provide other precious services that are often skimped on, like, guess
>what!, proper casing-flags, flags for inflection, etc.  So you can
>search for: {white constrain initial-cap} adjacent {house constrain
>initial-cap, singular} and get all "properly capitalized" references to
>the White House, with or without article.

I should note that my primary experience is with Verity, so I'm
thoroughly familiar with all these operations; I consider a search
engine inferior if it doesn't support them.  All the same, most people
seem unwilling to learn complex query languages to get the necessary
precision in their results.

>> (Trust me, this is one area I'm a real expert.)
>
>I'm out of touch with recent research in the field -- it's been over a
>decade since I did computational linguistics -- but my recent usage of
>full-text search engines, now so prevalent on the net, does not seem
>to show me any advance in the art.  Maybe there are new algorithms
>and data structures providing speed-ups on the search engine (some
>of the engines ARE remarkably fast, I will admit), but the *search
>quality* issues, which are connected with what to include in the
>tables, search-language semantics, etc, seem to be still all there
>(indeed, the search engine I find myself using most often [because it's
>so wondrously fast] is extremely poor this way -- can't even ask for
>adjacency or nearby-words, etc, etc; but "brute force wins again":-).

Yeah, one of the reasons I hate AltaVista so much is because it does a
remarkably poor job of result ranking, even by the relatively poor
standards of search engines.

However, every time I see a serious attempt to improve automated search
results, it leads directly to the mucky swamp of AI and "natural
language", with degraded search results.  I remember one search engine
that I typed something like "cat" into, and got documents back
containing "cow" because they were both animals (I don't remember the
exact example).

(Note that I've never been a particular expert on academic research.
I've always been much more interested in deployable applications.)
--
                      --- Aahz (Copyright 2000 by aahz at netcom.com)

Androgynous poly kinky vanilla queer het    <*>     http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6

"Let's go home and turn on MTV.  I want to watch some Pop-Up Videos."
"Pop-Up Videos is not on MTV, it's on VH-1."
"'MTV' is a generic."



More information about the Python-list mailing list