Identifying and Merging Related Bibliographic Records

by Jeremy A. Hylton

Master of Engineering Thesis
MIT Department of Electrical Engineering and Computer Science
Submitted February 13, 1996

Thesis Supervisor: Prof. Jerome H. Saltzer

Abstract

Bibliographic records freely available on the Internet can be used to construct a high-quality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an $n$-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on automatic linking between bibliographic records and copies of documents available on the Internet.

Availability

  • Postscript and gzipped Postscript.
  • Text version. (Converted using LaTeX2HTML; this one doesn't have the figures.)
  • LCS Technical Report MIT/LCS/TR-678. gzipped Postscript.

  • last updated 4/16/96 by jeremy
    links updated 11/00 by jeremy