Up: Identifying and Merging Related
Previous: Automatic linking
The underlying database and environment for storing and comparing bibliographic records, in particular the n-gram string comparison, were not the primary focus of this thesis. If the system is going to support a very large collection (millions of records) or many simultaneous users, its performance needs to be improved. The system also needs a few other implementation changes to allow long-term use in a production environment.
The prototype system was implemented primarily in Perl 5, which allowed rapid prototyping at the cost of execution-time effficiency. The n-gram string comparison is particularly slow in the current implementation; during the second round of cluster creation, loading records from disk and performing the detailed comparison proceeds at less than 10 records per second (on a 25 MHz RS/6000).
One consequence of the implementation decisions is that it is difficult to identify the bottlenecks. The Perl implementation of, for example, n-gram comparisons is clearly slow, but implementing it in a different language might speed it up enough that some other part of the system becomes the bottleneck. The remaining comments on performance should be considered with this constraint in mind.
It appears that loading bibliographic records into the system is costly. The records are stored in their original text format, and parsed each time the record is loaded. The primary cost of loading a record is parsing the Bibtex, so it may be profitable to develop a more easily parse intermediate format.
The cost of performing full-text queries during the construction of potential match pools appears to be the next mostly costly part of clustering, after the n-gram comparisons. In addition to optimizing the internal workings of the search engine, there may be an opportunity for global query optimizations. Three three-word queries are created for each record in the collection; it seems probable that the same query would be generated more than once, both because of duplicate records and because authors are likely to generate different works with some of the same title words. If performing a query is a signficant bottleneck, the queries could be re-ordered to take advantage of repeated queries.
One important limitation, independent of performance considerations, is that it does not record the source of a bibliographic record. When a particular bibliography is integrated into the main collection, there is no way to record that it was originaly part of, say, the USENIX bibliography. As a result, it is difficult to keep the collection up-to-date and incorporate changes and additions from a bibliography that has already been included.
The production system should record the source in a consistent way, so that the main collection can continuously incorporate changes from external sources. A ``data pump'' could be set up to monitor sources of bibliographic records and add new records or update modified records.
Recording the source of a record enables other value-added services, such as judging the quality of a record based on its source, which are described below.
One approach to improving the quality of bibliographic information, the one described in this thesis, is to locate related bibliographic records and merge them in a way that improves the quality of information. A different approach is to use authority control.
In a library catalog, authority control describes the process of identifying each of the unique names in the catalog-usually names of authors and names of subject headings-and finding all of the variant forms of the name within the catalog. An authority record describes the authoritative form of the name along with any variants.
Authority records can be integrated into the catalog, but more often they are used by librarians to help in the preparation of the catalog. New entries in the catalog can be checked against the authority records to determine the proper form of the name, and old records can be updated to use the authoritative form.
Using authority control to regularize the use of certain fields, notably author, journal, and publisher, would improve the quality of the records visible to the user and, in the case of the author field, the quality of the clustering algorithm. (Recall that problems parsing and comparing author lists accounted for most of the missed matches during clustering.) A system for authority control, however, would have to deal with some of the same problems the clustering algorithm handles now; it needs to identify as many variant entries as possible without being so aggressive that truly different entries are conflated.
Authority control for journals, publishers, and conferences would not affect the creation of author-title clusters, but would make it easier to produce union records and would improve the value of the ``field consensus'' and ``source match'' ratios. Creating the authority records, however, would be a labor-intensive process, requiring a human cataloger to generate a list of authoritative names and review possible variations to determine if they in fact refer to the same object. It should be possible to automate much of the process by looking for plausible variations on and abbreviations of the authoritative name, but some variations would be virtually impossible: The Journal of Library Automation, for example, changed its name to Library Resources and Technical Services. On the other hand, it is possible that journals in different fields could be abbreviated the same way; possible conflicts in abbreviations should be reviewed by a human cataloger.
This observation about the need for human supervision of authority control applies to library cataloging in general. The identification of basic bibliographic information-the author and title of the work, the pages it appears on, etc.-is a largely clerical process. (Fully automated cataloging is an active area for research, but little progress has been made .) Instead, cataloging should focus information that is more difficult to obtain-whether two authors with similar names are in fact the same person or whether two papers with similar but different titles actually represent the same work. Heaney makes the same case in his argument for an object-oriented cataloging standard.
The clustering algorithm identifies author-title clusters, in part because identifying equivalence and derivative bibliographic relationships has the most advantage for users and in part because they can be reliably identified in the presence of mixed-quality records. Identifying other bibliographic relationships would also be useful; if authority control (or some other mechanism) is used to improve the quality and consistency of field values, this problem would be easier to tackle.
The hierarchical relationship holds between a composite work and its parts-between a journal issue and the articles it containes or between a conference proceedings and the papers it contains. The wide variation in the journal and booktitle fields makes this relationship hard to identify in the current collection, but authority control could make possible comparisons. The relationship could be stored as journal issue clusters or proceedings clusters that contain all of the articles from a particular issue of a journal or all the papers presented at a conference.
The hierarchical relationship would be a useful addition to the current search interface. When a user finds an interesting paper, he could examine the proceedings clusters to see if any similar work was presented or check the journal cluster for an accompanying article. Clusters could also identify the sequential relationship by linking together the journal issue or proceedings clusters, which would provide a three-level hierarchy from browsing; users could move between clusters for individual articles, clusters for issues, and clusters for entire journals.
The referential relationship is interesting because it cannot, in general, be identified using bibliographic information alone. References that involve critique or review, e.g. a Computing Reviews article, might be identifiable, but citations do not contain enough information to determine that one paper is cited by another paper.
The referential relationship could be identified if an information dossier contained information in addition to bibliographic records-in particular, if it contained the citation list or the entire text of the document. The information dossier is a particularly useful notion because it can include information of all sorts, such as the full text of the document or information on how to order it.
Non-bibliographic information can be included in a dossier if the author and title can be identified and matched with an existing author-title cluster. Extending the dossier allows a much richer set of interactions with the library collection. For example, a user browses a document and discovers a reference to another work that is potentially of interest. The user highlights the reference with his or her mouse and clicks a button, and the document that was referenced appears in a new window on the user's screen.
Including the full-text of a document (including the citations) enables many other applications as well. Users can perform queries across the entire text of a document, which creates more opportunities for discovery relavant documents. Abstracts and summaries can be automatically generated for the documents , which can help the user quickly establish the relevance of a document to the current search. Citation indexes and graphs can be created that show how often and how widely a particular paper or conference proceedings is cited.
One related issue that does not seem to be well-understood is machine processing of citations intended to be read by humans. It is difficult to design a general purpose processor that can identify the distinct parts of a citation; possible problems include identifying the individual authors names and distinguishing between different numeric values, like years, page numbers, and volume/issue numbers. Some leverage on the problem can be gained by looking for bibliographic records that are ``similar'' to the citation, using the structured information contained in bibliographic records to try and understand the unstructerd citation. Eytan Adar and I  proposed one scheme for linking the two.
The automatic processes for identifying and merging bibligraphic records work quite well in general, but human intervention would be helpful for correcting the errors that do occur. In general, the system should allow users to make corrections and changes to the bibliographic records and to the author-title clusters.
There are at least two different actions that a librarian might want to perform. First, the librarian should be able to change the contents of an author-title cluster by explicitly labelling a pair of records as related or not related. Marking two records as related would cause the author-title cluster to contain all records that have the same author and title fields as one of the two records.
Second, librarians should be able to label the quality of a source record or a particular collection of source records. Even a simple quality control scheme that allowed records to be marked as high quality, low quality, or mixed quality would improve the creation of composite records.
The library collection would also benefit from other kinds of human interaction. The collection of bibliographic records and a system for managing information dossier provides a basic infrastructure for supporting collaborative and cooperative work. An annotation service that allowed users to share reviews and critiques of documents is an example of such a service.
The two primary conclusions to draw from this work are that bibliographic relationships can be automatically identified in mixed-quality source records and that freely available bibliographic information can provide the basis for a useful and relatively complete index of the computer science literature.
The author-title clustering algorithm, described in Chapter 3, successfully identifies related bibliographic records that describe the same work. The algorithm tolerates errors in the records and variability in the cataloging practices, but maintains a tolerably low error rate; testing the algorithm with a small, controlled sample showed that it identified more than 90 percent of the related records and mistakenly linked records for two different works less than 1 time in 100. The effects of mistaken links are mitigated by presenting the user with information about the amount of variation in the underlying records.
The clustering algorithm uses a full-text index of the source records to limit the number of inter-record comparisons and to overcome errors in the author and title fields have have caused other algorithms to fail.
The Digital Index for Works in Computer Science demonstrates that it is possible to create a useful information discovery service from heterogenous sources of bibliographic information. It uses the clustering algorithm to integrate records from many sources and in multiple formats without any more coordination between sources than now exists. The system can automatically incorporate records from other collections and from individual citation lists without requiring that the creators of those records change their current practice.
The 240,000-record DIFWICS collection is broad in scope: It covers a large part of the computer science literature, including most areas of speciality and a large percentage of the total literature cataloged by the ACM between 1977 and 1993.
The DIFWICS catalog identifies individual works, linking together duplicate records and different documents with the same author and title, and helps users find online documents. The work-centered catalog reduces redundancy in search results and makes inter-document relationships clearer, and the preliminary automatic linking work speeds the process of searching for papers on the Web and suggests that the process could be fully automated.
Up: Identifying and Merging Related
Previous: Automatic linking