[XML-SIG] Useless fun thing for XML - comments or helpers?

Walter Underwood wunder@infoseek.com
Tue, 08 Sep 1998 10:27:52 -0700


At 08:38 AM 9/5/98 +0200, Lars Marius Garshol wrote:
>* Walter Underwood
>>
>> [...] Well, we do pay attention to one tag -- the first <title>
>>or <TITLE> tag is considered to be the title of the document for 
>>purposes of displaying search hits.
>
>Hmmm. Have you considered using architectural forms to give page authors
>more freedom, but still allow you to discover which elements are the
>equivalents of 'TITLE' and 'AUTHOR' etc?

The general form of our answer for feature requests is "if paying
customers want it, we'll look at it". Of course, we're providing
XML even though we only have one customer asking for it (so far).

The Architectural Forms proposal looks interesting, and I actually
hope it catches on, since it could make our job easier. The search
engine only needs to know a little bit of info, basically, what is
content, what is meta-content, and what is formatting. Actual 
interpretation and display is the job of some other program. That
is why the search engine only needs well-formed XML, rather than
valid XML. But a *small* set of common base architectural forms
could allow the parser to sort out some of the basic data/metadata
elements.

Interestingly, this supports the earlier rule-of-thumb in the 
attribute vs. element discussion. If it is something that should
be searchable, represent it with an element.

At 04:04 PM 9/5/98 -0700, Lisa Rein wrote:
>I am very curious how exactly XML is being utilized in the search engine
>if the only tag  being taken into account is the (first) TITLE tag (just
>like a search engine would use during a "bag of words" approach) and not
>using a DTD -- making any semantic associations impossible.  
>
>If you're not going to deal with the text until after it's parsed, why
>are you using XML?  Are you doing some kind of indexing or another
>variation I haven't of?  Do tell ;-)

The goal is to make XML documents "findable" via web search. If we
treated them as raw text, the elements names would show up in search
results and would swamp queries like "xml" or "doctype" with irrelevant
hits. Parsing the XML allows us to give quality results. Being independent
of the DTD allows us to handle the widest variety of documents. So far,
that looks like a "sweet spot" in XML support. DTD-specific search
can get very complex, very fast.

Remember, the web server still serves the document. The search engine
only provides a URL to it. So the search engine just needs enough
info to serve a URL. Anything else gets in the way.

One clarification -- this feature is for the Ultraseek Server
product (http://software.infoseek.com), a search engine that people 
can buy and run locally. Ultraseek Server features are somewhat
indpendent of features for www.infoseek.com, the on-line search service. 

Finally, the XML market is very new, and this will be the first release
of our XML support. As the market matures, customers will tell us 
what they want and don't want, and we'll respond.

wunder


Walter R. Underwood
wunder@infoseek.com
wunder@best.com (home)
http://www.best.com/~wunder/
1-408-543-6946