Next: Identifying Related Records
Up: Identifying and Merging Related
This chapter introduces the conceptual framework for creating, using, and relating bibliographic records. It also discusses some of the more practical issues associated with the specific records used in the experimental computer science library.
Bibliographic records have traditionally described specific physical objects or units of publication. This thesis describes a different use of bibliographic information and a different focus for cataloging: I use bibliographic records to describe a work instead of a particular document. The term work is used here to mean a unit of intellectual content, which may take on one or more published forms (or none at all); each published form is a document. A paper, presented at a conference and later revised and published in a journal or collected in a book, is a work, each version a different ``document.''
I emphasize the work over the document because I believe that the primary use of the library catalog is to find works. A patron looking for a copy of Hamlet is usually looking for the work Hamlet, and, depending on circumstance, any particular copy of Hamlet might do. The library catalog should help patrons identify the works available and then chose a particular document, based on the patron's needs.
The MIT library catalog, for example, returns a list of 17 items as the result of a title search for ``Hamlet.'' It takes careful review to determine that the list contains eight copies of Shakespeare's play, three recorded performances, one book about performances, three musical works inspired by the play, and two copies of Faulkner's novel titled The hamlet. (Copies that exist in various collected works do not appear in the search results.)
The Hamlet search illustrates the problem of cataloging particular documents but not the works they represent. Deciphering the search results took several minutes and required study of the source bibliographic records to determine exactly what each item was. The search results would be easy to understand if they were organized around works and relationships. The eight copies of the play are all versions of the same work, Hamlet, and the performances might be considered derivative works. In a music library, it might be useful to highlight the three distinct musical works, and note that each is related to the play.
This chapter discusses recent work in library science that suggests catalogers should focus on the work instead of the document. It presents a taxonomy of bibliographic relations, which helps to clarify the difference between a work and a document, and it discusses the experimental collection these theories will be applied to, a collection of 250,000 computer science citations.
Standards for cataloging and for using bibliographic information are a current subject of research in the library community. Two majors themes run through several recent papers :
The most recent theoretical framework for descriptive cataloging was formulated in the 1950s by Seymour Lubetzky. According to Wilson , Lubetzky suggested the library catalog serves two functions: the finding function and the collocation function.
Current cataloging practice places more emphasis on the first function than the second. This emphasis is strange, Wilson says, because most discussions of the theoretical background conclude that patrons are not interested in particular documents, so much as in the works they represent. The catalog standards result partly from the historical development of catalogs as simple shelf lists and partly from the ease of cataloging discrete physical objects, rather than works, which might constitute only part of an object or span several of them.
The emphasis on the physical object over the work is inadequate in several ways.
The user's interest in the work rather than the document has already been noted. Smiraglia and Leazer  note that anecdotal evidence supports this claim and that catalog usage studies show that the bibliographic fields used to differentiate between variant editions are seldom used.
Library catalogs and other collections of bibliographic records are used increasingly as networked information discovery and retrieval tools, where discovering what kinds of works exist is more important than finding out which works are in the local library. When a user wants an item, there are many other retrieval options other than the local library, including an Internet search and inter-library borrowing.
Trends in publishing make the bibliographic model increasingly unwieldy: Electronic publishing and advances in paper publishing technologies have made it easy to change and update documents. As a result, it is now common for each new printing of a book to incorporate some corrections and additions or for authors to publish electronic copies of their papers that include changes made after print publication. OCLC cataloging rules require a new bibliographic record be created for each copy of a work that has different date of impression and different text. Heaney  cites a message from Bob Strauss to the Autocat mailing list that describes the problem; Strauss observes that between the original publication of Ed Krol's Whole Internet Catalog in 1992 and his search on Dec. 2, 1993, nine different versions have been cataloged in the OCLC union catalog (see Table 2-1).
Table: Versions of the Whole Internet Catalog cataloged in the OCLC union catalog
The problem raises two questions: First, is it sensible to create new records to describe each version of a document? Second, should the average user be exposed to this level of detail? The answer to the first question is unclear; the answer to the second, in many cases, may be no.
Tillett  and others have developed taxonomies of bibliographic relationships. Tillett's taxonomy provides a useful vocabulary for discussing the different kinds of documents that describe the same work, as well as relationships between different works. The seven categories presented here are based on Tillett's taxonomy, although some of the categories are slightly different.
In an online environment, where a paper's citation list may be as accessible as standard bibliographic information, this expanded notion seems useful.
User queries are another example of a kind of shared characteristic relationship. The results of a query all share the particular characteristic described by the query.
Identifying and representing each of the relationships described in the previous section is beyond the scope of this thesis. Instead, I focus on identifying a limited set of relationships and presenting the rough form of a user-interface for those relationships.
The algorithm presented in the next chapter identifies related records based on the author and title fields; if the fields are the same, it concludes the records describe the same work. Two relationships hold between records in such a cluster: equivalence and derivative relationships; some records will be duplicate citations of the same document and others will cite different documents that represent the same work.
Identifying works is difficult. Even when a human cataloger is reviewing two bibliographic records, it can be difficult to tell it they describe the same work without referring to actual copies of the cited work. A cluster is an algorithmically-generated set of related records, which may or may not be the same as the actual set of records for a particular work. A cluster generated by one algorithm may be better in one way or another than a cluster generated by a different algorithm. (Indeed, the next chapter describes several algorithms that identify only duplicate records and do not consider works.)
This thesis uses author-title clusters, which identifies a work as a unique author and title combination. Any pair of records with the same title and same authors are considered equivalent for the purpose of creating an author-title cluster, although there may be unusual cases where this test does not discriminate between two different works.
Using the term equivalent requires some care; it sounds simple enough, but equivalence depends entirely on the context in which some equivalence test is applied. (Consider, for example, the four different equality tests in Common Lisp .) The records in an author-title cluster are equivalent for the purpose of identifying a work, but are probably not equivalent for the purpose of locating the work in a library or retrieving it across the network.
The two uses of equivalent above isolate two separate problems that must be addressed in a catalog that is work-centered, but constructed from bibliographic records that have not been prepared with this use in mind. The first problem is identifying the works described by the records in the catalog. The second problem is identifying the separate documents in the author-title cluster and presenting them as different instances of the main work. This process involves identifying the different documents within the cluster and merging duplicate citations for each document into a single, composite record.
The derivative relation holds between the different documents in an author-title cluster, but identifying each different document is complicated by many factors, including the version problem and the difficulty of relying on Bibtex for finely-nuanced descriptions. I solve a simplified version of the problem by using the Bibtex entry type to identify the different, horizontal classes of records within a cluster. Thus, a cluster might be presented to the user as containing two document types-an article and a technical report-but would not distinguish between, say, different editions of a book.
Bibliographic formats affect how well relationships and works can be identified. The format not only dictates what information can be recorded, but also tends to affect the practice of recording.
Many freely available bibliographic records use the Bibtex format , which is used to produce citations lists in the LaTeX document preparation system. About 90 percent of the records in DIFWICS are Bibtex records.
Being able to accept bibliographic records in any format is a design goals of DIFWICS, because it minimizes the need for coordination among publishers, catalogers, and libraries and maximizes the number of records available for immediate use.
Because other bibliographic formats, like Refer or Tib, are less common than Bibtex records, the current implementation supports only one other bibliographic format, the CS-TR format developed as part of the Computer Science Technical Report Project (CS-TR)  and defined by RFC 1807 .
The two formats differ in both syntax and semantics, so using both formats interchangeably requires a common format that both can be converted into. The common format involves some information loss, when one format captures more information about a particular field than the other format is capable of expressing. For example, the CS-TR format has separate fields for authors and corporate authors, but Bibtex has only a single author field for both kinds of author; the common format does not capture the distinction, because it is not possible to determine which kind of author is being referred to in Bibtex.
Several characteristics of the Bibtex and CS-TR formats affect the design of the library and the kinds of bibliographic relationships that can be identified. Bibtex is currently used as the common format, because Bibtex is capable of describing any document that can be described with a CS-TR record (albeit with some loss of information).
Bibtex files are used for organizing citations and preparing bibliographies. The format is organized around several different entry types (see Table 2-2), which describe how a document was published. Each type uses several of the two dozen standard fields to describe the publication.
Table: Bibtex document types
The format is very flexible. There are few rules governing precisely how a field must be formatted and users are encouraged to define their own fields as necessary.
The individual fields fall into three categories-required, optional, and ignored-depending on the document type; the journal field is required for an Article, but ignored for a TechnReport. Ignored fields allow users to define their own fields. Throughout this thesis, I call the required and optional fields the standard fields and the ignored fields non-standard.
Most of the specific fields are easy to use, process, and understand-like month, year, or journal-but a few fields contain unstructured information about the document being cited, notably note, annote, abstract, and keywords. In practice, the note, annote, and (non-standard) keywords field often appear to be confused; the note field is intended for miscellaneous information to print with a citation and the annote field for comments about the cited document, such as would be included in an annotated bibliography.
The CS-TR format was designed specifically for universities and R& organizations to exchange information about technical reports. The fields it uses are geared specifically towards describing technical reports, allowing a more detailed description of technical reports than standard Bibtex fields, but limits its usefulness for describing other documents.
CS-TR defines a few mandatory fields used for record management and 25 other fields, all of which are optional. Some of the fields can be easily converted to Bibtex-CS-TR date maps to Bibtex month and year (with loss of the day) and CS-TR title is the same as Bibtex title. Most of the CS-TR fields don't have an analogue in the standard Bibtex fields, and must be omitted or placed in a non-standard field or the note field.
MARC is the predominant bibliographic format in the library community. While it is not used by the system presented here, the MARC record makes an interesting point of comparison.
The MARC record is a highly structured format; its use emphasizes precise labels for fields and detailed descriptions of the items being cataloged. Crawford  provides an overview of MARC and its use in libraries; he observes that all MARC records share five characteristics:
The MARC format defines several hundred fields, many of which have subfields, that specify the format and content of the field values exactingly. The primary field used for author (field number 100, "Main Entry-Personal Name"), has subfield codes for specifying the personal name, titles or dates associated with the name, and fuller forms of the name; another code indicates whether the personal name begins with a forename, single surname, or multiple surnames.
MARC's precision makes comparing records more difficult for several reasons. There is more opportunity for small errors in MARC; several different fields can be used to enter the same information; and there is some flexibility as to how much information must be entered. Users of electronic library catalogs will probably be familiar with the problem of determining when two author entries are the same-separate listings appear when one record has date of birth, while another has dates of birth and death and a third may contain a fuller form of the author's name.
The practical implication of the differences between MARC records and citation-oriented records like Bibtex is that while Bibtex records are not as rich in information they provide a much simpler structure for extracting information, like author and title, which are used to distinguish between different works.
Bibtex's flexibilty allows people to enter information in many ways. The use of abbreviations in fields values is very common, which makes it difficult to compare two fields to see if they have the same value; ignoring capitalization, Communications of the ACM is abbreviated variously as ``CACM'', ``C. ACM'', ``C.A.C.M.'', ``Comm. ACM'', ``Comm. of the ACM (CACM)'', etc.
Notes about the document, e.g. that it is an abstract only or that it is a revised edition, are entered in many different ways. Although the notes field seems to be the most likely candidate for this information, it is variously entered in the title field (complicating comparisons), in the note or annote field, or in the edition field (where ``2nd'' is as likely as ``second'').
There appears to be less variation among the different sources of CS-TR records, because the definition of each field is fairly specific and because technical reports have fewer unusual cases than other document types.
The problems of abbreviations and variations are less pronounced in CS-TR records because they are produced by the publishing institutions, which tend to be consistent within their own records. CS-TR records are also easier to handle because many of the fields are unused in the records available today; more than half the records use no more than seven descriptive fields.
DIFWICS incorporates bibliographic records from two major collections available on the Internet. The primary source is Alf-Christian Achilles' collection of 450,000 Bibtex records, titled ``A Collection of Computer Science Bibliographies'' . This work organizes several hundred individual collections of varying size and quality. The second source is the CS-TR records produced by the five participants in the CS-TR project; the collections is composed of approximately 6,000 technical report records from Berkeley, Carnegie Mellon, Cornell, M.I.T., and Stanford.
The first collection requires some explanation to understand what kinds of records it provides and how they affect the system for identifying related records. The individual bibliographies fall into three major categories-personal bibliographies organized by individual researchers, journal and conference proceedings bibliographies, and bibliographies organized around a particular subject. Although Bibtex is commonly used to prepare citation lists for papers, none of the source files suggest they were prepared for that purpose.
There are only a few personal bibliographies, but each is quite large and appears to have been created and checked with some care. Joel Seiferas's collection holds 43,000 theory citations and Gio Wiederhold's collection holds 10,000 citations, mostly about databases.
The journal and conference bibliographies tend to be fairly complete listings of the articles or papers published. The bibliography for the Journal of the ACM, for example, includes every article published from 1954 to 1995.
The topical collections range widely from a 5,000-record collection on programming languages and compilers to a 24-record collection on fuzzy Petri nets.
Achilles has organized the collection into major subject areas and we have selected an arbitrary subset of the records in each category to include in DIFWICS-in all, about 240,000 records; the remainder of the collection had not been processed at the time of this writing. A brief description of each categories and the number of records included from it follows. (Sizes are rounded to the nearest 5,000.)
The time period spanned by the DIFWICS collection, the size of the records, and the number of fields they contain also provide some useful measures of the collection.
The year field gives a fairly precise measure of when documents in the collection were published, although some 17,000 records (about 5 percent of the collection) contain year fields where the actual year cannot be recognized automatically.
Figure: Number of citations per year in bibliography collection and ACM Guide to Computing Literature
Half the records included in the computer science library cite documents published between 1988 and 1993, with a peak in 1991. The number of citations grows exponentially from 124 citations for documents published in 1956, to 1212 citations for 1969, and 10402 citations in 1985. Figure 2-1 compares the DIFWICS collection to the number of documents cited in the annual ACM Guide to Computing Literature.
The number of citations drops sharply after 1991-with 12,527 cites for 1994 and only 5,382 for 1995. It is not clear what the cause of the drop is. Two plausible hypotheses are that Bibtex is becoming less popular for preparing citation lists or that there is a delay of a year or between the time a paper is published and the time it accumulates enough citations to appear in a collection. (Achilles's collection includes the most recent copies of the individual bibliographies, with little aparent delay.)
Figure: Distribution of records by size
The source records range in size from from 50 bytes to 10,000 bytes, but more than 90 percent of them are between 200 and 1,000 bytes long with an average length of 397 bytes. (See Figure 2-2.)
The Bibtex records have, on average, entries in 6.7 different fields, and 90 percent of the records use between four and nine fields. (See Figure 2-3.)
Figure: Distribution of Bibtex records, arranged by the number of fields used
Next: Identifying Related Records
Up: Identifying and Merging Related