Automatic linking

Next: Conclusions and Future Up: Identifying and Merging Related Previous: Presenting Relations and

Automatic linking

The previous chapter described the current Web interface, which includes links to Web search services in its display of author-title clusters. The search links are a first step towards automatically linking records in DIFWICS directly to copies of documents on the Web.

The general scheme for automatically linking a cluster to copies of the work available on-line is the same as the scheme for identifying the clusters in the bibliographic collection: A full-text search using some arbitrarily selected words from the author and title field will turn up potential copies and a more detailed comparison of those copies will find actual instances of the work.

Searching for related Web citations

The work reported on here does not perform automatic linking, but it makes a first step in that direction and the preliminary results offer some insight on a full-scale automatic linking project. The basic insight is that any document that is accessible from the World-Wide Web has a hypertext link to it - either from another page or through some search system; the hypertext link is a citation for the document, and often the hypertext link is included from a traditional citation that describes the document.

The Web catalog interface's links to other search services - to the reading room catalog, the AltaVista and Excite Web indices, and the NCSTRL and UCSTRI technical report indices - provide the first-step full-text search for online documents. The current system requires that the user perform the second filter manually on the search results, but the process can be automated.

The individual search links are created with some knowledge of the particular search interface and the format of citations on the Web. The AltaVista search engine makes efficient use of long quoted strings, so the search looks for occurrences of the full title. The reading room catalog interface does not catalog journal articles, but does catalog journals and uses the word ``holdings'' in each journal entry; for journal articles, the catalog search uses a few words from the journal name and the word holdings to check the journal's availability.

A pair of examples illustrate some of the success and pitfalls of this approach to automatic linking. The first example is a search for the paper ``Obliq: A Language with Distributed Scope'' by Luca Cardelli. The paper was issued as a DEC SRC technical report, and is available from the author's personal Web pages. The results of a search for this paper are unusually good, because SRC's technical report archive includes a seperate page for each technical report, which matches the queries very closely.

Searches in AltaVista, Excite, UCSTRI, and the reading room catalog all return a link to the report in the SRC technical report archive as the best match for the search. (NCSTRL does not index SRC technical reports.) The two Web searches use relevance ranking to order all the Web pages that matched at least some of the query terms. Among the most relevant pages are:

The Web searches returned Cardelli's personal page with a link to a Postscript copy of the paper, but it is not ranked highly in the list of search results. It appears in about 30th position.

A search for the paper ``A Theory of Primitive Objects: Second-Order Systems'' by Martin Abadi and Cardelli produces more representative results, because it is not a SRC technical report. The search illustrates the benefits of the AltaVista full-string search over the Excite keyword-only search. The Excite search locates many pages that contain mention of the authors and their work on type theory, but no pages with links to the desired paper. The AltaVista search, on the other hand, locates two pages maintained by the authors with links to the paper. These pages appear to be several levels deep on their local filesystem, so it is likely they are not included in the smaller Excite index. (The reading room link shows that it has a copy of the conference proceedings that include the paper.)

Principles for fully-automated system

These basic results suggest several things about how to design a system for automatically tracking down citations on other Web pages, instead of requiring the user to take the second step of examining individual Web pages.

1. Large-scale Web indexes are likely to index many pages that contain references to the paper being sought. These references will be a mix of normal citations, found in other papers, lecture notes, and other works, and of Web-based citations, like those found in personal publication lists. Some but not all of these citations will contain hypertext links to a digital copy of the document.

2. The citations that included hypertext links are often found on Web pages several levels deeper than a server's main Web page. Many of the smaller Web indexes omit these pages, so searches of these indexes are more likely to return no relevant pages or pages that lead to the revelant citation but do not actually contain it. For example, the second Excite search described above returned a page about a book on objects written by Abadi and Cardelli, which in turn contained links to the author's personal pages, which contained links to lists of publications.

3. Very few papers are available on the Web in an easily indexed format like HTML or plain text. Most papers are available as Postscript, which is not easily indexed. As a result, it is uncommon to discover pages where the title or search summary clearly indicates that the page contains the sought-after document.

We experimented with two services that augmented the current interface for searching the Web. One service performed searches automatically, retrieved the first 10 pages returned, and searched the pages for citations that were similar to the document being sought. Another service interposed a Web proxy server between the user that tracked the user's examination of the search results and recorded what page the paper was actually found on. The proxy let the user navigate the Web as normal, but added a header to the top of each page that showed the author and title of the paper being sought; the proxy header also contained a link for the user to follow when the paper had been found. Neither of the experimental services were robust enough or successful enough to include in the current interface, but our brief use of them suggests that they would be interesting areas for future work.

Next: Conclusions and Future Up: Identifying and Merging Related Previous: Presenting Relations and

Jeremy A Hylton
Mon Feb 19 15:33:12 1996