Jeremy Hylton : weblog : 2003-11-17

Grouping Bibliographic Records

Monday, November 17, 2003

Jon Udell points to a D-Lib article about grouping bibliographic records into works and expressions. The problem is that you search for Hamlet and get many results back -- different editions, different movies, a copy within a collected works, etc. They all show up as different entries even though at some abstract level they are the same. How can we make effective interfaces for searching that exploits those relationships?

This was the subject of my master's thesis, Identifying and Merging Related Bibliographic Records. I looked at a narrower problem identifying duplicates and related items in the computer science publications.

The article by Thom Hickey, chief scientist at OCLC, and Edward O'Neill and Jenny Toves reports on work undertaken in response to the Functional Requirements for Bibliographic Records report. The article says:

The most innovative part of the report dealt with the first group of entities, describing the hierarchical relationships that cluster bibliographic items into manifestations, expressions and works.

At the end of the article, they note: It's implemented in Python and uses Twisted. Good for them! My thesis work was in Perl, the first and last substantial project I wrote in Perl.

They have a fairly straightforward approach that relies on high quality bibliographic records and authority files. Step One in creating a "work set" is to use the normalized primary author and title. I was working with records in bibtex and RFC 1357 formats, where errors in author and title were routine. In the absence of record or title authorities, I used n-grams to find approximate matches. One of my conclusions was:

Using authority control to regularize the use of certain fields, notably author, journal, and publisher, would improve the quality of the records visible to the user and, in the case of the author field, the quality of the clustering algorithm. (Recall that problems parsing and comparing author lists accounted for most of the missed matches during clustering.) A system for authority control, however, would have to deal with some of the same problems the clustering algorithm handles now; it needs to identify as many variant entries as possible without being so aggressive that truly different entries are conflated.

Hickey's article describes a much nicer user interface than I actually implemented. In particular, they address how to compare work sets.