Top 10 Language Constructs (Phyton)

Alex Martelli alex at magenta.com
Sat Jul 15 04:37:45 EDT 2000


"Aahz Maruch" <aahz at netcom.com> wrote in message
news:8knng5$eue$1 at slb3.atl.mindspring.net...
> In article <8kn71u02nn4 at news2.newsguy.com>,
> Alex Martelli <alex at magenta.com> wrote:
> >"Dinu C. Gherman" <gherman at darwin.in-berlin.de> wrote in message
> >news:396EE7BC.DA2FA7B3 at darwin.in-berlin.de...
> >>
> >> I doubt this will be very useful, like the answer to the
> >> question: "Which are the ten most used words in English?".
> >
> >That one is pretty useful, actually (it identifies the all-too-
> >frequent words, like 'the', 'of', etc, which a full-text search
> >engine had better just skip over:-).
>
> That's wrong.  How do you search for the phrase "the White House"?

You get all adjacencies of 'white' and 'house', assuming your
search tables are case-insensitive (which they should not be,
for umpteen reasons, but that's another thread, I guess).  This
only adds a modest number of false positives: since you were
going to get references to "the white house at the corner of
Elm and Main" anyway, it's no great hardship to also get any
"a white house, quite pretty, with a red roof".

But meanwhile you're protecting yourself against losing many
relevant positives due to quirks of English syntax and idiom.
"White House sources denied the allegations" and "The White
House denied the allegations" are basically equivalent; "the",
like all such syntax-marker words, tends to appear or
disappear from phrases depending, unsurprisingly, on syntax
(and idiom) issues.

And your tables shrink usefully.  You can invest the saved
space to provide other precious services that are often
skimped on, like, guess what!, proper casing-flags, flags
for inflection, etc.  So you can search for: {white constrain
initial-cap} adjacent {house constrain initial-cap, singular}
and get all "properly capitalized" references to the White
House, with or without article.


> (Trust me, this is one area I'm a real expert.)

I'm out of touch with recent research in the field -- it's
been over a decade since I did computational linguistics --
but my recent usage of full-text search engines, now so
prevalent on the net, does not seem to show me any advance
in the art.  Maybe there are new algorithms and data
structures providing speed-ups on the search engine (some
of the engines ARE remarkably fast, I will admit), but the
*search quality* issues, which are connected with what to
include in the tables, search-language semantics, etc, seem
to be still all there (indeed, the search engine I find
myself using most often [because it's so wondrously fast]
is extremely poor this way -- can't even ask for adjacency
or nearby-words, etc, etc; but "brute force wins again":-).


Alex






More information about the Python-list mailing list